# Formats/sources of data in general

1) csv files 
2) json/sql format
3) fetching data from an API
4) web scraping

#### CSV files

The CSV file format is a popular format supported by many machine learning frameworks. The format is variously referred to "comma-separated values" or "character-separated values."

A CSV file stores tabular data (numbers and text) in plain text form. A CSV file consists of any number of records, separated by line breaks of some kind. Each record consists of fields, separated by a literal comma. In some regions, the separator might be a semi-colon.

Typically, all records have an identical number of fields, and missing values are represented as nulls or empty strings. There are a number of ways to load a CSV file in Pytho

In [2]:
import pandas as pd

#### 1) Opening a local csv file

In [4]:
df=pd.read_csv('aug_train.csv')

# just give the relative path in case it is not in the same folder

In [5]:
df

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19153,7386,city_173,0.878,Male,No relevent experience,no_enrollment,Graduate,Humanities,14,,,1,42,1.0
19154,31398,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,14,,,4,52,1.0
19155,24576,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,50-99,Pvt Ltd,4,44,0.0
19156,5756,city_65,0.802,Male,Has relevent experience,no_enrollment,High School,,<1,500-999,Pvt Ltd,2,97,0.0


#### 2) Opening a csv file from an URL

In [7]:
# This code performs the following tasks:
# 1. Downloads a CSV file from a GitHub URL using the `requests` library.
# 2. Adds a custom "User-Agent" header to mimic a web browser request, which helps bypass restrictions on automated downloads.
# 3. Converts the raw CSV text from the response into a file-like object using `StringIO` from Python's `io` module.
# 4. Uses `pandas.read_csv()` to read the CSV content from the `StringIO` object, returning a pandas DataFrame.
# This allows us to load data directly from a URL into a pandas DataFrame without saving it to disk.

import requests
from io import StringIO
import pandas as pd

url = "https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0"}
req = requests.get(url, headers=headers)  # Sends HTTP GET request with custom headers
data = StringIO(req.text)  # Converts the response text to a file-like object

pd.read_csv(data)  # Reads the CSV from the StringIO object into a pandas DataFrame


Unnamed: 0,Country,Region
0,Algeria,AFRICA
1,Angola,AFRICA
2,Benin,AFRICA
3,Botswana,AFRICA
4,Burkina,AFRICA
...,...,...
189,Paraguay,SOUTH AMERICA
190,Peru,SOUTH AMERICA
191,Suriname,SOUTH AMERICA
192,Uruguay,SOUTH AMERICA


#### Sep parameter