# Reading CSV Files with Pandas - Complete Guide

This notebook demonstrates various techniques and best practices for reading and manipulating CSV files using the pandas library. 

## Table of Contents:
1. **Basic Imports** - Import required libraries
2. **Reading from URLs** - Loading CSV files from remote sources and saving locally
3. **Handling Column Headers** - Different ways to manage header rows
4. **Working with TSV Files** - Tab-separated values and custom delimiters
5. **Selecting Specific Columns** - Efficient data loading with column filtering
6. **Skipping Rows** - Removing unwanted rows during import
7. **Row Limits** - Reading partial datasets
8. **Encoding Issues** - Handling character encoding problems
9. **Error Handling** - Dealing with malformed data and missing values
10. **Data Type Optimization** - Reducing memory usage with dtype specifications
11. **Data Transformation** - Converting values during import
12. **Processing Large Files** - Chunking strategy for memory efficiency

---

In [None]:
# Import essential libraries for data manipulation and visualization
import pandas as pd  # For reading and manipulating CSV files
import matplotlib.pyplot as plt  # For creating visualizations

## Section 1: Reading CSV Files from URLs

You can directly read CSV files from remote URLs without downloading them first. This is useful for accessing public datasets hosted on repositories like GitHub.

**Key Parameters:**
- `index_col` - Specifies which column(s) to use as the index
- `to_csv()` - Method to save the dataframe to a local CSV file
- **Note:** Pandas automatically creates an 'Unnamed: 0' column if the CSV file doesn't have explicit index labeling

In [None]:
# Read CSV from a GitHub URL and save it locally
import pandas as pd

# URL pointing to a CSV file hosted on GitHub raw content
url = "https://raw.githubusercontent.com/Light200312/ML-models/refs/heads/main/Logistic%20Classification%20Regression%20Model/placement.csv"

# Read CSV from URL using index_col parameter
# index_col=['Unnamed: 0'] treats the unnamed column as the index instead of data
# This is commonly needed when CSV files have an implicit index column
df = pd.read_csv(url, index_col=['Unnamed: 0'])

# Save the loaded dataframe to a local CSV file for later use
# index=False prevents pandas from writing row indices to the file
df.to_csv("placements" + ".csv", index=False)

In [20]:
df

Unnamed: 0,cgpa,iq,placement
0,6.8,123.0,1
1,5.9,106.0,0
2,5.3,121.0,0
3,7.4,132.0,1
4,5.8,142.0,0
...,...,...,...
95,4.3,200.0,0
96,4.4,42.0,0
97,6.7,182.0,1
98,6.3,103.0,1


## Section 2: Understanding Column Headers

The `header` parameter defines which row is used as column names:
- `header=0` (default) - First row is treated as headers
- `header=1` - Second row is treated as headers (skips the first row as data)
- `header=None` - No header row; columns are named 0, 1, 2, etc.

In [None]:
# Example: Reading a CSV file with explicit header parameter
# df2 = pd.read_csv("top_250_imdb_dataset.csv", header=1)  # Use second row as headers

# Using default header=0 - the first row becomes column names
df2 = pd.read_csv("top_250_imdb_dataset.csv", header=0)

# Display first 5 rows to verify the headers and data structure
df2.head(5)

Unnamed: 0,Rank,Title,Year,Rating,Duration_Minutes,Certificate,Source
0,1,The Shawshank Redemption,1994,,142,R,IMDb Top 250
1,2,The Godfather,1972,,175,R,IMDb Top 250
2,3,The Dark Knight,2008,,152,PG-13,IMDb Top 250
3,4,The Godfather Part II,1974,,202,R,IMDb Top 250
4,5,12 Angry Men,1957,,96,Approved,IMDb Top 250


In [None]:
# Example: Reading a TSV (Tab-Separated Values) file

# sep="\t" indicates the file uses tabs as delimiters instead of commas
# names parameter assigns custom column names to the dataframe
# When names is provided, the first data row is not interpreted as headers
df3 = pd.read_csv("file.tsv", sep="\t", names=['col0', "col1", "col2", "col3"])

# Display first 5 rows to verify successful parsing
df3.head(5)

Unnamed: 0,col0,col1,col2,col3
0,0,50,5,881250949
1,0,172,5,881250949
2,0,133,1,881250949
3,196,242,3,881250949
4,186,302,3,891717742


## Section 3: Working with TSV (Tab-Separated Values) and Custom Delimiters

TSV files use tabs instead of commas as delimiters. You can handle any delimiter with the `sep` parameter.

**Key Parameters:**
- `sep="\t"` - Specifies tab as the delimiter (use "," for comma, ";" for semicolon, etc.)
- `names` - Provide custom column names instead of reading them from the file
- When `names` is specified, the first data row becomes row 0 instead of being treated as headers

In [None]:
# Example: Load only specific columns from a CSV file
# The usecols parameter accepts a list of column names to load
# This is efficient for large files where you only need certain columns
df4 = pd.read_csv("top_250_imdb_dataset.csv", usecols=['Title', 'Year'])

# Display first 5 rows - notice only the specified columns are loaded
df4.head(5)

Unnamed: 0,Title,Year
0,The Shawshank Redemption,1994
1,The Godfather,1972
2,The Dark Knight,2008
3,The Godfather Part II,1974
4,12 Angry Men,1957


## Section 4: Selecting Specific Columns with `usecols`

When dealing with large CSV files, loading only the columns you need can significantly reduce memory usage and improve performance.

**Benefits:**
- **Memory Efficiency** - Only loads specified columns into memory
- **Speed** - Reduces I/O operations and parsing time
- **Code Clarity** - Makes data requirements explicit

In [None]:
# Example: Skip specific rows while reading a TSV file
# skiprows=[0,2] tells pandas to skip row 0 (first row) and row 2 (third row)
# This is useful when CSV files have header information or formatting rows to ignore
pd.read_csv("file.tsv", sep="\t", skiprows=[0, 2])

Unnamed: 0,0,172,5,881250949
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596
...,...,...,...,...
99995,880,476,3,880175444
99996,716,204,5,879795543
99997,276,1090,1,874795795
99998,13,225,2,882399156


## Section 5: Skipping Rows During Import

Use `skiprows` to skip specific rows or ranges of rows during import. This is useful when:
- CSV files have header information or comments at the beginning
- Multiple rows need to be ignored before the actual data starts
- You want to skip rows with metadata or formatting

In [None]:
# Example: Read only the first 5 rows of a CSV file
# nrows=5 limits the import to 5 rows (plus header if present)
# Useful for quick previews of large datasets
pd.read_csv("top_250_imdb_dataset.csv", nrows=5)

Unnamed: 0,Rank,Title,Year,Rating,Duration_Minutes,Certificate,Source
0,1,The Shawshank Redemption,1994,,142,R,IMDb Top 250
1,2,The Godfather,1972,,175,R,IMDb Top 250
2,3,The Dark Knight,2008,,152,PG-13,IMDb Top 250
3,4,The Godfather Part II,1974,,202,R,IMDb Top 250
4,5,12 Angry Men,1957,,96,Approved,IMDb Top 250


## Section 6: Reading a Limited Number of Rows with `nrows`

The `nrows` parameter limits the number of rows to be read from the CSV file. This is beneficial for:
- **Testing** - Quickly preview a small sample of large files
- **Memory Management** - Load manageable chunks for initial exploration
- **Development** - Speed up code development iteration

## Section 7: Handling File Encoding Issues

Character encoding problems occur when files are saved in non-UTF-8 formats. Common encodings include UTF-8, Latin-1/ISO-8859-1, UTF-16, and CP1252.

### How to Find a File's Encoding:

1. **Using a Text Editor** - Open the file in Notepad++ and check the Encoding menu to see the current format

2. **Using Python's chardet Library** - Programmatically detect encoding:
```python
from encodings.aliases import aliases
alias_values = set(aliases.values())

for encoding in set(aliases.values()):
    try:
        df = pd.read_csv("test.csv", encoding=encoding)
        print('successful', encoding)
    except:
        pass
```

3. **Try Common Encodings** - UTF-8, Latin-1, and CP1252 cover most cases

### Solution:
```python
pd.read_csv("file.csv", encoding='latin1')  # Specify encoding if UTF-8 fails
```

In [None]:
# Default Encoding and Error Handling
# ====================================

# 1. DEFAULT ENCODING:
#    Pandas uses 'utf-8' by default when reading CSV files.
#    If you get encoding-related errors, try specifying encoding explicitly:
#    pd.read_csv("file.csv", encoding='latin1')

# 2. COMMON ENCODINGS:
#    - 'utf-8' (most common, default)
#    - 'latin1' or 'iso-8859-1' (Western European)
#    - 'cp1252' (Windows Western)
#    - 'utf-16' (Unicode)

# 3. HOW TO DETECT ENCODING:

# METHOD A: Using a text editor like Notepad++
#   - Open file → Check Encoding menu

# METHOD B: Using Python's chardet library
#   import chardet
#   with open("file.csv", 'rb') as f:
#       result = chardet.detect(f.read())
#       print(result)  # Returns detected encoding and confidence

In [None]:
# Example: Reading a CSV with error handling and date parsing

# Read a CSV file from a remote URL with error handling and date conversion
# Parameters:
#   - sep='\t': File uses tab delimiter
#   - on_bad_lines='skip': Skip any lines that cause parsing errors (empty or malformed)
#   - parse_dates=['date_inclusion']: Automatically convert this column to datetime type
df5 = pd.read_csv(
    "https://raw.githubusercontent.com/alimanfoo/csvvalidator/master/example-data-bad.csv",
    sep='\t',
    on_bad_lines='skip',
    parse_dates=['date_inclusion']
)

# Display the data
print(df5)

# Get detailed information about the dataframe structure and data types
df5.info()

   study_id patient_id gender age_years age_months date_inclusion    x
0         x          4      F         2         27     2011-01-01  NaN
1         1          x      F         2         25     2011-01-01  NaN
2         1          1      x         2         27     2011-01-01  NaN
3         1          2      M         x         61     2011-01-01  NaN
4         1          3      F         9          x     2011-01-01  NaN
5         1          3      M         1         17     2011-01-01  NaN
6         1          4      M         2         25     2011-01-01    x
7         2          1      M       200         32     2011-01-01  NaN
8         2          2      M         3         24     2011-01-01  NaN
9         2          3      F         7         90     2011-01-01  NaN
10        2          4      F         1         14     2011-01-01  NaN
11        2          5      F         2         25     1999-13-01  NaN
12        2          6      M         2         25     1999-12-32  NaN
<class

## Section 8: Error Handling - Dealing with Bad Lines and Missing Values

Corrupted CSV files may contain lines with:
- Different numbers of columns than expected
- Invalid data format
- Missing values that need special handling

**Key Parameters:**
- `on_bad_lines='skip'` - Skip lines that cause parsing errors
- `parse_dates` - Automatically convert specified columns to datetime format
- `error_bad_lines` (deprecated in newer pandas) - Legacy parameter for skipping bad lines

In [None]:
# Example: Optimize memory by specifying data types
# The dtype parameter accepts a dictionary mapping column names to data types
# This tells pandas exactly what type to use, avoiding type inference overhead
# Key benefit: Reduces memory usage, especially important for large datasets
pd.read_csv(
    "top_250_imdb_dataset.csv",
    dtype={'Year': float, 'Duration_Minutes': float}
)

Unnamed: 0,Rank,Title,Year,Rating,Duration_Minutes,Certificate,Source
0,1,The Shawshank Redemption,1994.0,,142.0,R,IMDb Top 250
1,2,The Godfather,1972.0,,175.0,R,IMDb Top 250
2,3,The Dark Knight,2008.0,,152.0,PG-13,IMDb Top 250
3,4,The Godfather Part II,1974.0,,202.0,R,IMDb Top 250
4,5,12 Angry Men,1957.0,,96.0,Approved,IMDb Top 250
...,...,...,...,...,...,...,...
245,246,To Be or Not to Be,1942.0,,99.0,Approved,IMDb Top 250
246,247,The Grapes of Wrath,1940.0,,129.0,Approved,IMDb Top 250
247,248,Gangs of Wasseypur,2012.0,,321.0,Not Rated,IMDb Top 250
248,249,Drishyam,2015.0,,163.0,Not Rated,IMDb Top 250


## Section 9: Data Type Optimization with `dtype` Parameter

By specifying data types while reading, you can:
- **Reduce Memory Usage** - Different numeric types use different amounts of memory (int vs int64, float vs float32)
- **Improve Performance** - Reduces time spent on type inference
- **Ensure Data Integrity** - Explicitly declares expected data types

**Example Memory Savings:**
- int32 uses 4 bytes vs int64 uses 8 bytes
- float32 uses 4 bytes vs float64 uses 8 bytes

In [None]:
# Create a sample CSV file with Cricket T20 match data
import pandas as pd

# Define CSV data as a multi-line string
csv_data = """team,score,wickets,overs
Mumbai Indians,192,5,20
Chennai Super Kings,188,3,20
Royal Challengers Bangalore,205,8,20
Kolkata Knight Riders,160,10,18.4
Gujarat Titans,172,4,20
"""

# Write the data to a local CSV file
with open('t20_scores.csv', 'w') as f:
    f.write(csv_data)

print("File 't20_scores.csv' is ready!")

File 't20_scores.csv' is ready!


## Section 10: Creating Sample Data and Value Transformation

This section demonstrates:
1. Creating a CSV file programmatically with sample data
2. Using converter functions to transform column values during import

**Key Concept:** The `converters` parameter applies functions to column values as they're being read, enabling data transformation at import time.

In [None]:
# Define a converter function to transform team names to abbreviations
# This function will be applied to the 'team' column during CSV reading
def convert_team_name(team):
    """
    Convert full team names to their official cricket abbreviations.
    
    Args:
        team (str): Full team name as it appears in the CSV
        
    Returns:
        str: Abbreviated team name (2-3 letters) or 'unknown' if not recognized
    """
    if team == "Mumbai Indians":
        return "MI"
    elif team == "Chennai Super Kings":
        return "CSK"
    elif team == "Royal Challengers Bangalore":
        return "RCB"
    elif team == "Kolkata Knight Riders":
        return "KKR"
    elif team == "Gujarat Titans":
        return "GT"
    else:
        return "unknown"

In [47]:
pd.read_csv("t20_scores.csv", converters={'team':convert_team_name})# or after reading the file you can use remaname function to rename the team names

Unnamed: 0,team,score,wickets,overs
0,MI,192,5,20.0
1,Csk,188,3,20.0
2,RCB,205,8,20.0
3,KKR,160,10,18.4
4,GT,172,4,20.0


In [None]:
#handle value that need to treated as na values while reading csv file
pd.read_csv("file.csv",na_values=['NA','?','-']) # this will treat 'NA','?','-' as na values while reading csv file

In [51]:
# hande big file in chunks 
chunksize=10000
for chunk in pd.read_csv("Earthquakes_indonasia.csv",chunksize=chunksize):
    # process each chunk here
    print(chunk.shape)
    print(chunk.head(2))

(10000, 13)
          tgl            ot   lat     lon  depth  mag  \
0  2008/11/01  21:02:43.058 -9.18  119.06     10  4.9   
1  2008/11/01  20:58:50.248 -6.55  129.64     10  4.6   

                     remark  strike1  dip1  rake1  strike2  dip2  rake2  
0  Sumba Region - Indonesia      NaN   NaN    NaN      NaN   NaN    NaN  
1                 Banda Sea      NaN   NaN    NaN      NaN   NaN    NaN  
(10000, 13)
              tgl            ot   lat     lon  depth  mag  \
10000  2011/10/25  18:04:58.525 -3.14  140.71     10  2.7   
10001  2011/10/25  17:52:02.625 -8.89  121.48     42  2.3   

                          remark  strike1  dip1  rake1  strike2  dip2  rake2  
10000     Irian Jaya - Indonesia      NaN   NaN    NaN      NaN   NaN    NaN  
10001  Flores Region - Indonesia      NaN   NaN    NaN      NaN   NaN    NaN  
(10000, 13)
              tgl            ot   lat     lon  depth  mag  \
20000  2015/03/23  23:11:15.421 -2.45  139.44     21  3.1   
20001  2015/03/23  22:31:04