**# Import Pandas**


In [1]:
import pandas as pd

# 1. Opening a Local CSV File
To open a local CSV file using Pandas, you can use the `pd.read_csv()` function. Here's an example:
```# Read a CSV file
df = pd.read_csv('your_file.csv')
```

In [None]:
df = pd.read_csv('aug_train.csv')
df

# 2. Opening a CSV File from a URL
You can also open a CSV file directly from a URL using the same `pd.read_csv()` function. Here's how you can do it:
```# Read a CSV file from a URL 
df = pd.read_csv('https://example.com/your_file.csv)
```

In [None]:
import requests
from io import StringIO

url = 'https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv' # Replace with actual URL
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0"}
response = requests.get(url, headers=headers)
data = StringIO(response.text)
df = pd.read_csv(data)
df

# Sep Parameter
If your CSV file uses a different delimiter (not a comma), you can specify it using the `sep` parameter in the `pd.read_csv()` function. For example, if your file uses a semicolon (`;`) as a delimiter, you can do the following:
```# Read a CSV file with a semicolon delimiter
df = pd.read_csv('your_file.csv', sep=';')
```

What is delimiter?
A delimiter is a character or sequence of characters that separates data fields in a text file or data stream. In the context of CSV (Comma-Separated Values) files, a delimiter is used to indicate where one data field ends and the next one begins. The most common delimiter is a comma (`,`), but other characters such as semicolons (`;`), tabs (`\t`), or pipes (`|`) can also be used depending on the format of the data.

In [None]:
pd.read_csv('movie_titles_metadata.tsv', sep='\t', names=['s_no', 'title', 'release_year', 'rating', 'voting', 'genre']) # TSV (Tab separated Value) file uses tab as a delimiter. (sep = ';' is replaced with \t (tab character))

# issue with this file : 1st row becomes header row
# solution : ``header=None``  or set by ourself using  ``names=['col1', 'col2', ...]``

# Index Column Parameter
When reading a CSV file, you can specify which column to use as the index of the DataFrame using the `index_col` parameter in the `pd.read_csv()` function. For example, if you want to set the first column (column 0) as the index, you can do the following:
```# Read a CSV file and set the first column as the index  
df = pd.read_csv('your_file.csv', index_col=0)
```

In [43]:
pd.read_csv('aug_train.csv')
# making enrollee_id as index column
pd.read_csv('aug_train.csv', index_col='enrollee_id')


# making 1st column as index column
#pd.read_csv('aug_train.csv', index_col=0)

Unnamed: 0_level_0,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
enrollee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
8949,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
7386,city_173,0.878,Male,No relevent experience,no_enrollment,Graduate,Humanities,14,,,1,42,1.0
31398,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,14,,,4,52,1.0
24576,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,50-99,Pvt Ltd,4,44,0.0
5756,city_65,0.802,Male,Has relevent experience,no_enrollment,High School,,<1,500-999,Pvt Ltd,2,97,0.0


# Header Parameter
When reading a CSV file, if the first row of the file does not contain the column headers, you can specify this using the `header` parameter in the `pd.read_csv()` function. You can set `header=None` to indicate that there are no headers in the file. For example:
```# Read a CSV file without headers
df = pd.read_csv('your_file.csv', header=None) 
        or 
header =0 (default) 
        or 
header =1 (if 2nd row is header)
```

In [40]:
pd.read_csv('test.csv')
# No headers in this file
# solution : header=None
pd.read_csv('test.csv', header=None)
# or set by ourself using ``header =0`` (default) or ``header=1`` (if 2nd row is header)
pd.read_csv('test.csv', header=1)

Unnamed: 0,0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0
1,2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0
2,3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1
3,4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0


#  Use Cols Parameter
If you want to read only specific columns from a CSV file, you can use the `usecols` parameter in the `pd.read_csv()` function. For example, if you want to read only the columns 'A' and 'C', you can do the following:
```# Read specific columns from a CSV file
df = pd.read_csv('your_file.csv', usecols=['A', 'C'])
```

In [None]:
pd.read_csv('aug_train.csv', usecols=['city', 'gender'])

# Skiprows Parameter
If you want to skip a certain number of rows at the beginning of the CSV file while reading it, you can use the `skiprows` parameter in the `pd.read_csv()` function. For example, if you want to skip the first 3 rows of the file, you can do the following:
```# Read a CSV file and skip the first 3 rows
df = pd.read_csv('your_file.csv', skiprows=3)
```

# nrows Parameter
If you want to read only a specific number of rows from the CSV file, you can use the `nrows` parameter in the `pd.read_csv()` function. For example, if you want to read only the first 10 rows of the file, you can do the following:
```# Read a CSV file and limit to the first 10 rows
df = pd.read_csv('your_file.csv', nrows=10)
```

In [None]:
pd.read_csv('aug_train.csv', skiprows=0)
#pd.read_csv('aug_train.csv', skiprows=range(1, 5))  # Skip first 4 rows after header
#pd.read_csv('aug_train.csv', skiprows=[2, 5, 10])  # Skip specific rows

pd.read_csv('aug_train.csv', nrows=100)  # Read only the first 5 rows

# Encoding Parameter
When reading a CSV file, you may encounter files that use different character encodings. To handle this, you can specify the encoding using the `encoding` parameter in the `pd.read_csv()` function. For example, if your file is encoded in UTF-8, you can do the following:
```# Read a CSV file with UTF-8 encoding
df = pd.read_csv('your_file.csv', encoding='utf-8')
```

In [None]:
pd.read_csv('zomato.csv', encoding='latin-1')

# Skip bad lines 
If your CSV file contains malformed lines that you want to skip while reading, you can use the `on_bad_lines` parameter in the `pd.read_csv()` function. For example, to skip bad lines, you can do the following:
```# Read a CSV file and skip bad lines
df = pd.read_csv('your_file.csv', on_bad_lines='skip')
or 
df = pd.read_csv('your_file.csv', error_bad_lines=False)

differnce between on_bad_lines and error_bad_lines:
- `on_bad_lines`: This parameter is used to specify how to handle bad lines in the CSV file. You can set it to 'skip' to skip bad lines, 'warn' to issue a warning for bad lines, or 'raise' to raise an error when bad lines are encountered.
- `error_bad_lines`: This parameter is a boolean that indicates whether to raise an error when bad lines are encountered. If set to `True`, an error will be raised for bad lines; if set to `False`, bad lines will be skipped without raising an error.

# dtypes Parameter
When reading a CSV file, you can specify the data types for each column using the `dtype` parameter in the `pd.read_csv()` function. This is useful when you want to ensure that certain columns are read as specific data types. For example, if you want to read a column 'A' as integers and column 'B' as floats, you can do the following:
```# Read a CSV file with specified data types for columns
df = pd.read_csv('your_file.csv', dtype={'A': int, 'B': float})
```


In [None]:
#pd.read_csv('aug_train.csv').info()  # Check for any issues
# There is a column named target which has only 0 and 1 values but its dtype is float64 instead of int64. So we can fix it while reading the CSV file.
pd.read_csv('aug_train.csv', dtype={'target': 'int64'}) # this will convert the 'target' column to int64 dtype while reading the CSV file.

# Handling Dates 
When reading a CSV file that contains date columns, you can use the `parse_dates` parameter in the `pd.read_csv()` function to automatically parse those columns as datetime objects. For example, if you have a column named 'date' that contains date information, you can do the following:
```# Read a CSV file and parse the 'date' column as datetime
df = pd.read_csv('your_file.csv', parse_dates=['date'])
```

In [None]:
#pd.read_csv('IPL Matches 2008-2020.csv').info() # date columns are read as object dtype 
pd.read_csv('IPL Matches 2008-2020.csv', parse_dates=['date']).info() # date column is read as datetime64 dtype and any modifications can be done easily 


# Converters Parameter
When reading a CSV file, you can use the `converters` parameter in the `pd.read_csv()` function to apply custom conversion functions to specific columns. This is useful when you want to transform the data in a column while reading it. For example, if you have a column 'A' that contains string representations of numbers and you want to convert them to integers, you can do the following:
```# Read a CSV file and apply custom conversion to column 'A'
df = pd.read_csv('your_file.csv', converters={'A': int})
```

In [108]:
# Creating a custom converter function which renames some team names :
def rename(name):
    if name == "Royal Challengers Bangalore":
        return "RCB"
    elif name == "Chennai Super Kings":
        return "CSK"
    else:
        return name
    
# function call 
rename("Royal Challengers Bangalore")  # Output: "RCB"
rename("Chennai Super Kings")        # Output: "CSK"


pd.read_csv('IPL Matches 2008-2020.csv', converters={'team1': rename, 'team2': rename}).head()

Unnamed: 0,id,city,date,player_of_match,venue,neutral_venue,team1,team2,toss_winner,toss_decision,winner,result,result_margin,eliminator,method,umpire1,umpire2
0,335982,Bangalore,2008-04-18,BB McCullum,M Chinnaswamy Stadium,0,RCB,Kolkata Knight Riders,Royal Challengers Bangalore,field,Kolkata Knight Riders,runs,140.0,N,,Asad Rauf,RE Koertzen
1,335983,Chandigarh,2008-04-19,MEK Hussey,"Punjab Cricket Association Stadium, Mohali",0,Kings XI Punjab,CSK,Chennai Super Kings,bat,Chennai Super Kings,runs,33.0,N,,MR Benson,SL Shastri
2,335984,Delhi,2008-04-19,MF Maharoof,Feroz Shah Kotla,0,Delhi Daredevils,Rajasthan Royals,Rajasthan Royals,bat,Delhi Daredevils,wickets,9.0,N,,Aleem Dar,GA Pratapkumar
3,335985,Mumbai,2008-04-20,MV Boucher,Wankhede Stadium,0,Mumbai Indians,RCB,Mumbai Indians,bat,Royal Challengers Bangalore,wickets,5.0,N,,SJ Davis,DJ Harper
4,335986,Kolkata,2008-04-20,DJ Hussey,Eden Gardens,0,Kolkata Knight Riders,Deccan Chargers,Deccan Chargers,bat,Kolkata Knight Riders,wickets,5.0,N,,BF Bowden,K Hariharan


# na Values Parameter
When reading a CSV file, you can specify additional strings to recognize as NA/NaN using the `na_values` parameter in the `pd.read_csv()` function. This is useful when your dataset uses specific placeholders for missing values. For example, if your file uses 'NA' or 'missing' or '-' to represent missing values, you can do the following:
```# Read a CSV file and specify additional NA values
df = pd.read_csv('your_file.csv', na_values=['NA', 'missing', '-'])
```

In [None]:
pd.read_csv('aug_train.csv', na_values=['Male', 'NA', 'missing', '-'])
# Here 'Male' in gender column will be treated as NaN values along with default NaN values. 

# Loading a Huge Dataset in Chunks
When dealing with very large CSV files that may not fit into memory, you can read the file in smaller chunks using the `chunksize` parameter in the `pd.read_csv()` function. This allows you to process the data in manageable pieces. For example, to read a CSV file in chunks of 1000 rows, you can do the following:
```# Read a CSV file in chunks of 1000 rows
df_iterator = pd.read_csv('your_file.csv', chunksize=1000)
for chunk in df_iterator:
    # Process each chunk
    print(chunk.head())
```

In [122]:
pd.read_csv('aug_train.csv').shape # (19158, 14)
df_iterator = pd.read_csv('aug_train.csv', chunksize=5000) # returns an iterator
for chunk in df_iterator:
    print(chunk.shape)  # Process each chunk

# Output:
# (5000, 14)
# (5000, 14)
# (5000, 14)
# (4158, 14)

(5000, 14)
(5000, 14)
(5000, 14)
(4158, 14)


In [123]:
# You can also concatenate all chunks into a single DataFrame after processing
df_iterator = pd.read_csv('aug_train.csv', chunksize=5000)
df = pd.concat(chunk for chunk in df_iterator)
df.shape  # (19158, 14)

(19158, 14)