# Data Gathering Methods
### CSV Files
### Web Scraping
### JSON/SQL
### Fetching data from API


# Working with CSV Files

### **Comma Separated file**
In this file  each row is separted by commas,
### ** TSV- Tab Separated File**
In this file values are separted by tab.

<p>The CSV file format is a popular format supported by many machine learning frameworks. The format is variously referred to "comma-separated values" or "character-separated values."</p>
<p>A CSV file stores tabular data (numbers and text) in plain text form. A CSV file consists of any number of records, separated by line breaks of some kind. Each record consists of fields, separated by a literal comma. In some regions, the separator might be a semi-colon.</p>
<p>Typically, all records have an identical number of fields, and missing values are represented as nulls or empty strings. There are a number of ways to load a CSV file in Python.</p>

### <b> 1.Opening a local CSV File</b>

In [3]:
import pandas as pd
pd.read_csv("placement.csv")


Unnamed: 0.1,Unnamed: 0,cgpa,iq,placement
0,0,6.8,123.0,1
1,1,5.9,106.0,0
2,2,5.3,121.0,0
3,3,7.4,132.0,1
4,4,5.8,142.0,0
...,...,...,...,...
95,95,4.3,200.0,0
96,96,4.4,42.0,0
97,97,6.7,182.0,1
98,98,6.3,103.0,1


### 2.Opening a csv file from a URL

In [5]:
import requests
import csv
from io import StringIO

url = 'https://www.w3schools.com/python/pandas/data.csv.txt'
response = requests.get(url)
csv_content = StringIO(response.text)
reader = csv.reader(csv_content)

for row in reader:
    print(row)

['Duration', 'Pulse', 'Maxpulse', 'Calories']
['60', '110', '130', '409.1']
['60', '117', '145', '479.0']
['60', '103', '135', '340.0']
['45', '109', '175', '282.4']
['45', '117', '148', '406.0']
['60', '102', '127', '300.0']
['60', '110', '136', '374.0']
['45', '104', '134', '253.3']
['30', '109', '133', '195.1']
['60', '98', '124', '269.0']
['60', '103', '147', '329.3']
['60', '100', '120', '250.7']
['60', '106', '128', '345.3']
['60', '104', '132', '379.3']
['60', '98', '123', '275.0']
['60', '98', '120', '215.2']
['60', '100', '120', '300.0']
['45', '90', '112', '']
['60', '103', '123', '323.0']
['45', '97', '125', '243.0']
['60', '108', '131', '364.2']
['45', '100', '119', '282.0']
['60', '130', '101', '300.0']
['45', '105', '132', '246.0']
['60', '102', '126', '334.5']
['60', '100', '120', '250.0']
['60', '92', '118', '241.0']
['60', '103', '132', '']
['60', '100', '132', '280.0']
['60', '102', '129', '380.3']
['60', '92', '115', '243.0']
['45', '90', '112', '180.1']
['60', '101'

In [9]:
df = pd.read_csv(url)  # Treats first row as data
print(df.head())

   Duration  Pulse  Maxpulse  Calories
0        60    110       130     409.1
1        60    117       145     479.0
2        60    103       135     340.0
3        45    109       175     282.4
4        45    117       148     406.0


### 3. skiprows parameter

In [13]:
df = pd.read_csv(url, skiprows=[3,4])  # Skips the first two rows
df

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,45,117,148,406.0
3,60,102,127,300.0
4,60,110,136,374.0
...,...,...,...,...
162,60,105,140,290.8
163,60,110,145,300.0
164,60,115,145,310.2
165,75,120,150,320.4


### 4. nrows patameter- Used to select number of rows that need to be read.
<p>The nrows parameter in pandas.read_csv() is used to limit the number of rows read from a CSV file.</p>
**Why use nrows?**
To preview part of a large file.
To reduce memory usage when testing or working with a sample.
To debug your data loading logic.

In [15]:
df = pd.read_csv(url,nrows=50)  # Skips the first two rows
df

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
5,60,102,127,300.0
6,60,110,136,374.0
7,45,104,134,253.3
8,30,109,133,195.1
9,60,98,124,269.0


### 5.Combine with skiprows for reading specific slice


In [19]:
# Skip first 10 rows and then read next 5
df1 = pd.read_csv(url, skiprows=10, nrows=5)
df1

Unnamed: 0,60,98,124,269.0
0,60,103,147,329.3
1,60,100,120,250.7
2,60,106,128,345.3
3,60,104,132,379.3
4,60,98,123,275.0


### 6. names parameter
**names=... tells pandas to use these column names.**
**header=None tells pandas not to treat the first row as column headers (treat it as data).**



In [23]:
import pandas as pd
custom_headers = ['Name', 'Age', 'Email']
df = pd.read_csv("placement.csv", names=custom_headers, header=None, skiprows=1)
print(df.head())


   Name    Age  Email
0   6.8  123.0      1
1   5.9  106.0      0
2   5.3  121.0      0
3   7.4  132.0      1
4   5.8  142.0      0


### 7. Index_col Parameter
<p>The index_col parameter in pandas.read_csv() is used to specify which column(s) should be used as the row index of the DataFrame. </p>
<p> When to use:
When your CSV has a unique identifier (like IDs) in the first column
When you want to simplify row referencing using meaningful labels</p>

In [3]:
import pandas as pd

df = pd.read_csv("placement.csv", index_col=0)
print(df.head())


   cgpa     iq  placement
0   6.8  123.0          1
1   5.9  106.0          0
2   5.3  121.0          0
3   7.4  132.0          1
4   5.8  142.0          0


In [5]:
df = pd.read_csv("placement.csv", index_col='cgpa')
df

Unnamed: 0_level_0,Unnamed: 0,iq,placement
cgpa,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
6.8,0,123.0,1
5.9,1,106.0,0
5.3,2,121.0,0
7.4,3,132.0,1
5.8,4,142.0,0
...,...,...,...
4.3,95,200.0,0
4.4,96,42.0,0
6.7,97,182.0,1
6.3,98,103.0,1


In [7]:
df = pd.read_csv("placement.csv", index_col=['cgpa', 'iq',"placement"])
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 0
cgpa,iq,placement,Unnamed: 3_level_1
6.8,123.0,1,0
5.9,106.0,0,1
5.3,121.0,0,2
7.4,132.0,1,3
5.8,142.0,0,4
...,...,...,...
4.3,200.0,0,95
4.4,42.0,0,96
6.7,182.0,1,97
6.3,103.0,1,98


### 8.use_cols parameter
<p> Why use usecols?
To load only relevant columns and reduce memory usage
To speed up reading large CSV files</p>
<p>You can pass:
A list of column names (as above), or
A list of column indices:</p>

In [8]:
import pandas as pd
df = pd.read_csv("placement.csv", usecols=['cgpa', 'iq'])
print(df.head())


   cgpa     iq
0   6.8  123.0
1   5.9  106.0
2   5.3  121.0
3   7.4  132.0
4   5.8  142.0


In [9]:
df = pd.read_csv("placement.csv", usecols=[0, 2])  # Reads the first and third column
df

Unnamed: 0.1,Unnamed: 0,iq
0,0,123.0
1,1,106.0
2,2,121.0
3,3,132.0
4,4,142.0
...,...,...
95,95,200.0
96,96,42.0
97,97,182.0
98,98,103.0


### 9.Squeeze Parameter
The squeeze parameter in pandas.read_csv() was used to convert single-column or single-row DataFrames into a Series automatically. However, it has been deprecated since pandas 1.4.0 and removed in later versions.

In [10]:
df = pd.read_csv("placement.csv", usecols=['cgpa'],squeeze=True)
# Returns a Series instead of a DataFrame if only one column is read
df

TypeError: read_csv() got an unexpected keyword argument 'squeeze'

In [11]:
# Modern way
import pandas as pd

df = pd.read_csv("placement.csv", usecols=['cgpa'])
cgpa_series = df['cgpa']  # Convert single-column DataFrame to Series

print(type(cgpa_series))  # <class 'pandas.core.series.Series'>
print(cgpa_series.head())


<class 'pandas.core.series.Series'>
0    6.8
1    5.9
2    5.3
3    7.4
4    5.8
Name: cgpa, dtype: float64


### 10. Skip bad line paramter
To skip bad lines (like rows with too many/few columns) when reading a CSV file using pandas, you can use the on_bad_lines parameter (the older error_bad_lines is deprecated).
on_bad_lines options:
**Value**	**Behavior**
'skip'	Skips bad lines (too many/few fields)
'warn'	Skips bad lines and shows a warning
'error' (default)	Raises an error if any bad line is encountered
Custom function	You can pass a function to manually handle bad lines

In [12]:
import pandas as pd

df = pd.read_csv("placement.csv", on_bad_lines='skip')
print(df.head())


   Unnamed: 0  cgpa     iq  placement
0           0   6.8  123.0          1
1           1   5.9  106.0          0
2           2   5.3  121.0          0
3           3   7.4  132.0          1
4           4   5.8  142.0          0


In [14]:
import pandas as pd

def handle_bad_line(line):
    print("Bad line found:", line)
    return None  # Skips the bad line

df = pd.read_csv("placement.csv", on_bad_lines=handle_bad_line, engine='python')
print(df.head())



   Unnamed: 0  cgpa     iq  placement
0           0   6.8  123.0          1
1           1   5.9  106.0          0
2           2   5.3  121.0          0
3           3   7.4  132.0          1
4           4   5.8  142.0          0


### 11.dtypes parameter
The **dtype** (or **dtypes**) parameter in pandas.read_csv() is used to explicitly set the data type for one or more columns while reading a CSV file.
Why use dtype?
Prevents pandas from inferring wrong types
Speeds up parsing
Avoids type conversion errors later

In [17]:
import pandas as pd

df = pd.read_csv(
    "placement.csv",
    dtype={
        'cgpa': float,
        'iq': int,
        'placement': str  # Use str (built-in type)
    }
)
print(df.dtypes)



Unnamed: 0      int64
cgpa          float64
iq              int64
placement      object
dtype: object


### 12. na_values parameter
<p> The na_values parameter in pandas.read_csv() is used to specify additional strings that should be recognized as missing values (NaN) in the dataset.</p>
<p> Why use na_values?
Your data might have non-standard missing indicators (e.g., "-", "?", "null")
Helps avoid manual cleaning after loading the data.</p>

In [1]:
# Example: Replace "NA", "n/a", and "-" with NaN
import pandas as pd
df= pd.read_csv("placement.csv", na_values=["NA", "n/a", "-"])
print(df.head())


   Unnamed: 0  cgpa     iq  placement
0           0   6.8  123.0          1
1           1   5.9  106.0          0
2           2   5.3  121.0          0
3           3   7.4  132.0          1
4           4   5.8  142.0          0


In [2]:
# Use per column (dictionary form)
# You can also pass a dictionary to specify NA values per column:
na_dict = {
    "Age": ["NA", "-1"],
    "Email": ["missing", "none"]
}
df = pd.read_csv("placement.csv", na_values=na_dict)
df

Unnamed: 0.1,Unnamed: 0,cgpa,iq,placement
0,0,6.8,123.0,1
1,1,5.9,106.0,0
2,2,5.3,121.0,0
3,3,7.4,132.0,1
4,4,5.8,142.0,0
...,...,...,...,...
95,95,4.3,200.0,0
96,96,4.4,42.0,0
97,97,6.7,182.0,1
98,98,6.3,103.0,1


### 13. Encoding Parameter
The **encoding** parameter in pandas.read_csv() tells pandas how to decode the bytes in your CSV file into characters — especially important for files that aren’t UTF-8 encoded.
Encoding	Use case
'utf-8'	Standard encoding (default)
'utf-8-sig'	UTF-8 with BOM (often needed for Excel-exported files)
'latin1'	Western European (ISO-8859-1) — useful for avoiding decode errors
'cp1252'	Windows encoding (similar to latin1)

In [19]:
import pandas as pd

df = pd.read_csv("placement.csv", encoding='utf-8')  # Default for most files
df

Unnamed: 0.1,Unnamed: 0,cgpa,iq,placement
0,0,6.8,123.0,1
1,1,5.9,106.0,0
2,2,5.3,121.0,0
3,3,7.4,132.0,1
4,4,5.8,142.0,0
...,...,...,...,...
95,95,4.3,200.0,0
96,96,4.4,42.0,0
97,97,6.7,182.0,1
98,98,6.3,103.0,1


**If you get a UnicodeDecodeError, try:**

In [20]:
df = pd.read_csv("placement.csv", encoding='latin1')
df

Unnamed: 0.1,Unnamed: 0,cgpa,iq,placement
0,0,6.8,123.0,1
1,1,5.9,106.0,0
2,2,5.3,121.0,0
3,3,7.4,132.0,1
4,4,5.8,142.0,0
...,...,...,...,...
95,95,4.3,200.0,0
96,96,4.4,42.0,0
97,97,6.7,182.0,1
98,98,6.3,103.0,1


In [22]:
df = pd.read_csv("placement.csv", encoding='utf-8', encoding_errors='replace')
df.head()

Unnamed: 0.1,Unnamed: 0,cgpa,iq,placement
0,0,6.8,123.0,1
1,1,5.9,106.0,0
2,2,5.3,121.0,0
3,3,7.4,132.0,1
4,4,5.8,142.0,0


### 14.Handling Dates
Use parse_dates for automatic date parsing.
Custom date format: Use date_parser with strptime.
Handle inconsistent dates by setting errors='coerce' in pd.to_datetime().

In [28]:
import pandas as pd
pd.read_csv("Drug_Use_Data_from_Selected_Hospitals.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,cgpa,iq,placement
0,0,6.8,123.0,1
1,1,5.9,106.0,0
2,2,5.3,121.0,0
3,3,7.4,132.0,1
4,4,5.8,142.0,0


### 15.Converters

### 16.Loading a huge dataset in Chunks