### -----------------------------CHAPTER 6-------------------------------

# pandas.read_csv Arguments: The Wild World of CSV Parsing
- `path`: The VIP pass to your data party—could be a filesystem location (like "C:/data.csv"), a URL (for those fancy online files), or a file-like object. Tell pandas where the treasure is buried!
- `sep or delimiter`: The bouncer that splits your row into fields. Give it a character (`,` for commas), or go wild with a regex (like `\s+` for sneaky spaces). No sep, no entry!
- `header`: Which row gets to wear the crown as column names? Defaults to 0 (first row), but set it to `None` if your file’s too cool for headers. No header, no problem!
- `index_col`: Pick your row label MVPs—column numbers or names to use as the index. Want a fancy hierarchical index? Toss in a list like `["key1", "key2"]` and watch the magic happen.
- `names`: Your chance to play god and name those columns yourself. Pass a list like `["a", "b", "c"]`—because who needs default 0, 1, 2 nonsense?
- `skiprows`: Rows to kick to the curb—either a number (skip the first N rows) or a list of row numbers (starting at 0). Perfect for dodging pesky comments or that awkward intro text.
- `na_values`: The bouncer’s blacklist—values like `"NA"`, `"NULL"`, or whatever you hate that get swapped for `NaN`. Add your own, but they join the default crew unless you say `keep_default_na=False`.
- `keep_default_na`: To roll with pandas’ default `NaN` squad (like `"NA"`, `"NULL"`) or not? `True` by default, but flip it to `False` if you’re a rebel who hates surprises.
- `comment`: The “shush” button—characters (like `#`) that mark the start of comments to ignore. Keeps your data free of chit-chat at the end of lines.
- `parse_dates`: Date whisperer mode—set to `False` by default (no parsing). Flip to `True` to sniff out dates in all columns, or give a list of columns (`[0, 1]` or `["date"]`) to focus on. Bonus: combine columns with a tuple like `["date", "time"]`!
- `keep_date_col`: After parsing dates from multiple columns, keep the originals around? `False` by default—pandas is tidy like that—but set to `True` if you’re sentimental.
- `converters`: Your personal data stylist—a dictionary like `{"foo": lambda x: x.upper()}` to transform column values. Make “foo” shout in caps if you feel like it!
- `dayfirst`: For those ambiguous dates (7/6/2012)—treat as international style (June 7, 2012) with `True`. `False` by default, because pandas assumes you’re an American date snob.
- `date_parser`: Bring your own date-parsing chef—pass a function to cook those dates just right. Default is pandas’ built-in guesswork.
- `nrows`: How much of the file to nibble? Set a number (e.g., 10) to read only the first N rows (excluding header). Perfect for a quick taste test!
- `iterator`: Turn your file into a lazy reader—returns a `TextFileReader` object to process it bit by bit. Pairs nicely with a `with` statement for that classy vibe.
- `chunksize`: For big file feasts, set the chunk size (e.g., 1000 rows) and munch through it piece by piece. Requires `iterator=True` to activate!
- `skip_footer`: Chop off the end—number of lines to ignore at the file’s bottom. Because who cares about the fine print?
- `verbose`: Chatty mode—set to `True` to get the gossip on parsing times and memory use. Great for nerds who love the behind-the-scenes drama.
- `encoding`: The secret handshake for text—defaults to `"utf-8"`, but tweak it (e.g., `"latin1"`) if your file’s speaking a different language.
- `squeeze`: If your data’s a one-column wonder, set to `True` to squeeze it into a `Series` instead of a DataFrame. Less clutter, more swagger!
- `thousands`: The bling separator for big numbers—like `","` in "1,000" or `"."` in "1.000". Default is `None`, so pandas won’t guess your style.
- `decimal`: The decimal diva—usually `"."` (like 3.14), but switch to `","` (like 3,14) if your file’s got European flair. Default is `"."`.
- `engine`: The horsepower under the hood—`"c"` (default, fast), `"python"` (slower but feature-rich), or `"pyarrow"` (new kid, blazing fast). Pick your ride wisely!

In [1]:
import IPython
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv("E:/ME/WES MECK - DATA ANALYSIS/GIT CLONE/pydata-book/examples/ex1.csv") #1st row is header
#IF NO HEADER
pd.read_csv("E:/ME/WES MECK - DATA ANALYSIS/GIT CLONE/pydata-book/examples/ex2.csv", header=None) #default header is 0
pd.read_csv("E:/ME/WES MECK - DATA ANALYSIS/GIT CLONE/pydata-book/examples/ex2.csv", names=['a', 'b', 'c', 'd', 'message']) #specify header
#INDEXING ROWS
pd.read_csv("E:/ME/WES MECK - DATA ANALYSIS/GIT CLONE/pydata-book/examples/ex1.csv", index_col='message') #set message as index
pd.read_csv("E:/ME/WES MECK - DATA ANALYSIS/GIT CLONE/pydata-book/examples/csv_mindex.csv", index_col=["key1" ,"key2"]) #set key1 and key2 as index
#HANDLING VARIABLE WHITESPACE
pd.read_csv("E:/ME/WES MECK - DATA ANALYSIS/GIT CLONE/pydata-book/examples/ex3.txt", sep=r'\s+') #use regex for whitespace and and first column becomes index because there are one less column header
#SKIP ROWS
pd.read_csv("E:/ME/WES MECK - DATA ANALYSIS/GIT CLONE/pydata-book/examples/ex4.csv", skiprows=[0, 2, 3]) #skip rows 0, 2, 3
#HANDLING MISSING VALUES
pd.read_csv("E:/ME/WES MECK - DATA ANALYSIS/GIT CLONE/pydata-book/examples/ex5.csv", na_values=['NULL']) #replace NULL with NaN
pd.read_csv("E:/ME/WES MECK - DATA ANALYSIS/GIT CLONE/pydata-book/examples/ex5.csv", na_values={'message': ['foo', 'NA'], 'something': ['two']}) #replace foo and NA in message and two in something with NaN
pd.read_csv("E:/ME/WES MECK - DATA ANALYSIS/GIT CLONE/pydata-book/examples/ex5.csv", keep_default_na=False) #do not replace default NaN values

**READING THE TEXT FILE IN PIECES**
- Chunksize -> it is a textFileReader object that is a iterable ..it gives a specified amount of rows in each iterations

In [None]:
pd.options.display.max_rows = 10
pd.read_csv("E:/ME/WES MECK - DATA ANALYSIS/GIT CLONE/pydata-book/examples/ex6.csv")
pd.read_csv("E:/ME/WES MECK - DATA ANALYSIS/GIT CLONE/pydata-book/examples/ex6.csv", nrows=5) #read only 5 rows
chunker = pd.read_csv("E:/ME/WES MECK - DATA ANALYSIS/GIT CLONE/pydata-book/examples/ex6.csv", chunksize=1000) #read in chunks
type(chunker)

tot = pd.Series([],dtype='float64')
for piece in chunker:
    tot = tot.add(piece["key"].value_counts(),fill_value=0)
tot.sort_values(ascending=False) #sort values in descending order

chunker.get_chunk(5) #get 5 rows from chunker

**WRITING DATA TO TEXT FORMAT ->we are using sys.stdout -> so that it only prints the data on console and not make a file**

In [None]:
data = pd.read_csv("E:/ME/WES MECK - DATA ANALYSIS/GIT CLONE/pydata-book/examples/ex5.csv")
data.to_csv("E:/ME/WES MECK - DATA ANALYSIS/GIT CLONE/pydata-book/examples/out.csv") #dumps the data of ex5 to a new file out.csv
#CUSTOMIZATIONS THAT WE CAN DO
import sys
data.to_csv(sys.stdout , sep="|") #separate by |
data.to_csv(sys.stdout , na_rep="NULL") #replace NaN with NULL
data.to_csv(sys.stdout , index=False , header=False) #do not write index and header
data.to_csv(sys.stdout , index=False , columns=["a","b","c"]) #write only columns a,b,c

**WORKING WITH OTHER DELEMITED FORMATS**
- if you don't wanna  use pandas, you can use csv module
- csv.reader -> it creates lines in lists
- delimiter: Field alag karne ka character (default ,).
- lineterminator: Line end (default \r\n).
- quotechar: Quotes ke liye (default ").
- quoting: Kitna quote karna hai (MINIMAL, ALL, etc.)

In [None]:
import csv
f = open("E:/ME/WES MECK - DATA ANALYSIS/GIT CLONE/pydata-book/examples/ex7.csv")
reader = csv.reader(f)
for line in reader:
    print(line)            #read csv file using csv module and return list of lines

with open("E:/ME/WES MECK - DATA ANALYSIS/GIT CLONE/pydata-book/examples/ex7.csv") as f:
    lines = list(csv.reader(f))
header , *values = lines
data_dict = {h:v for h , v in zip(header , zip(*values))}

"""dialect drama"""
class my_dialect(csv.Dialect):
    lineterminator = '\n'
    delimiter = ';'
    quotechar = '"'
    quoting = csv.QUOTE_MINIMAL
reader = csv.reader(f , dialect=my_dialect) #read csv file with custom dialect
reader = csv.reader(f , delimiter='|') #shortcut -> read csv file with custom delimiter

"""manual writing"""
writer = csv.writer(f , dialect=my_dialect)
writer.writerow(('one', 'two', 'three'))
writer.writerow(('1', '2', '3'))

f.close()

**JSON**
- null of json is None in python
- JSON Types: Dictionary (object), list (array), string, number, boolean, null.
- Reading: json.loads (string to Python), pd.read_json (file to DataFrame).
- Writing: json.dumps (Python to string), to_json (DataFrame to JSON).
- Customization: orient="records" or any specific column you can choose.
- More flexible then csv

In [None]:
#CONVERTING FROM JSON TO PYTHON AND PYTHON TO JSON
import json
obj = """obj ={"name": "Wes",
 "cities_lived": ["Akron", "Nashville", "New York", "San Francisco"],
 "pet": null,
 "siblings": [{"name": "Scott", "age": 34, "hobbies": ["guitars", "soccer"]},
              {"name": "Katie", "age": 42, "hobbies": ["diving", "art"]}]
} """

result = json.loads(obj) #convert json to python
asjson = json.dumps(result) #convert python to json

#NOW WE ARE GONNA SEE HOW TO CONVERT JSON TO PANDAS DATAFRAME
data = pd.read_json("E:/ME/WES MECK - DATA ANALYSIS/GIT CLONE/pydata-book/examples/example.json") #direct method
siblings = pd.DataFrame(result["siblings"],columns=["name","age"]) #manual/specific method

#CONVERTING DATAFRAME TO JSON
data.to_json(sys.stdout)
data.to_json(sys.stdout , orient='records') #convert to json with records orientation

**XML AND HTML -> WEB SCRAPPING**
- Libraries: lxml (fast), BeautifulSoup aur html5lib (flexible) install them for read_html
- read_html: It returns a list and converts table with <table> tag in HTML to dataframes
- Analysis: Once converted , you can do the analysis
- It is used to deal with web pages

In [None]:
tables = pd.read_html("E:/ME/WES MECK - DATA ANALYSIS/GIT CLONE/pydata-book/examples/fdic_failed_bank_list.html")
len(tables) #number of tables
failures = tables[0] #get the first table
failures.head() #get first 5 rows

close_timestamps = pd.to_datetime(failures["Closing Date"])
close_timestamps.dt.year.value_counts() #count the number of failures per year