## Pandas

- The most popular Python library for data analysis.

- There are two core objects in pandas: the DataFrame and the Series.

### DataFrame

*A DataFrame is a table. It contains an array of individual entries, each of which has a certain value. Each entry corresponds to a row (or record) and a column.*

We are using the **pd.DataFrame()** constructor to generate these DataFrame objects.

- The syntax for declaring a new one is a dictionary whose **keys** are the **column names** (Bob and Sue in the below example), and whose **values** are a **list** of entries. This is the standard way of constructing a new DataFrame, and the one you are most likely to encounter.

- The **list** of row labels used in a DataFrame is known as an **Index**. We can assign values to it by using an index parameter in our constructor:

In [1]:
import pandas as pd

In [2]:
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 
              'Sue': ['Pretty good.', 'Bland.']},
             index=['Product A', 'Product B'])


Unnamed: 0,Bob,Sue
Product A,I liked it.,Pretty good.
Product B,It was awful.,Bland.


## Series
*A Series, by contrast, is a sequence of data values. If a DataFrame is a table, a Series is a list. And in fact you can create one with nothing more than a list*

In [3]:
pd.Series([1, 2, 3, 4, 5])

0    1
1    2
2    3
3    4
4    5
dtype: int64

- A Series is, in essence, a single column of a DataFrame. So you can assign column values to the Series the same way as before, using an **index** parameter. 
- However, a Series does not have a column name, it only has one overall **name**:

In [4]:
pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')

2015 Sales    30
2016 Sales    35
2017 Sales    40
Name: Product A, dtype: int64

> So a CSV file is a table of values separated by commas. Hence the name: "Comma-Separated Values", or CSV

## Reading the Data Files

- wine_reviews = pd.read_csv("inputfilepath.csv", index_col=0) ---*("index_col" is specified to make pandas use the already present column in the CSV file  for the index instead of creating a new one from scratch)*


- wine_reviews.shape   ---*(We can use the shape attribute to check how large the resulting DataFrame is)*


- wine_reviews.head()  ---*(head" grabs the first five rows in the given DataFrame if not specified)*




### Saving the resulting DataFrame to a CSV file 

*If we have a DataFrame named "animals" as result of our model we will save that into CSV file in this way. We can give arbitrary name in the bracket*

- animals.to_csv("cows_and_goats.csv")

### NOTE

- Referring **Kaggle Notebooks for MICROCOURSES revision** is BETTER along with the Insights obtained from Excercises which are mentioned below

- For Intro to ML and Python Refer the MANUAL NOTES Diary.

## Insights found from the Kaggle MicroCourse Excercises of PANDAS

In [11]:
my_df=pd.DataFrame()


> If our DataFrame has a Column named "description" then:-
- df_series = my_df["description"]
- df_df     = my_df[["description"]]

- df_df is a DATAFRAME
- df_series is a SERIES 


- **Dont forget that for iloc and loc we use [] not ()**
- ex:- 
 -     my_df.loc[1,"description"]
 -     my_df.iloc[0,:3]
-      df=reviews.loc[[0,1,10,100],["country","province","region_1","region_2"]]

- **Use "&" ,"|" only.  "and" ,"or" Wont work in PANDAS DataFrames and Series**

> my_df.price.**idxmax()** returns the Index(Row Number) where maximum value of price is present in the my_df DataFrame



> a=[True,False]
- sum(a) returns 1
- Here in Lists,True is considered as 1 whereas False is considered as 0
- Same in case of Series.

> Dont forget using map and apply for Complex Operations. They help a lot!!

**reviews.groupby('winery').apply(lambda df: df.title.iloc[0])**

*Here First we are grouping the DataFrame "reviews" by "winery" Category and then We are selecting the First Row's Title from each Group*  

- groupby is Really POWERFUL

> _mean_r = reviews.groupby("taster_name").points.mean()

- In this "_mean_r"  We have indices as name of Reviewers and Values as Mean Score given out by that Reviewer



- NOTE that after calling dtype no Parenthesis are Required

### Sometimes the price column is null. How many reviews in the dataset are missing a price?
- #missing_price_reviews = reviews[reviews.price.isnull()]
- n_missing_prices = len(missing_price_reviews)


- #Cute alternative solution: if we sum a boolean series, True is treated as 1 and False as 0
- n_missing_prices = reviews.price.isnull().sum()


- #or equivalently:
- n_missing_prices = pd.isnull(reviews.price).sum()

**Dont forget to specify the Overlapping Edges before using the join() function**