# Introduction Of Pandas

Pandas is python library which can handle high level data manipulation.

Data is stored within an object like numpy as it's build on the Numpy Package.
In Pandas data is stored on an object known as a DataFrame

Unlike, numpy array's which is a n-dimensional array however it only tolerates one data type.

- rows are basically oberservations in a table.

- columns are the variables of these observations.

There are numerous way's to create a Pandas DataFrame:
- One way is manually creating and convert a dist to DF

```
import pandas as pd
dict = {
  countries: {"Brazil", "China", "India", "Japan", "Pakistan"}
  capital: {"Brazilia", "Beijing", "Delhi", "Tokyo", "Islamabad"}
  population: {22.3, 200.4, 190, 25.4, 30.4}}

table = pd.DataFrame(dict)
or
table = pd.DataFrame([dict]) # Incase it throws a series or other erros

table
```
- The other way is using a csv file

```
# Import csv file & Include the first column as index
# index_col parameter indicates the first col as index
table = pd.read_csv("path/to/table.csv", index_col = 0) 
table

```

In [None]:
import pandas as pd

In [None]:
dict = {
  'countries' : ["Brazil", "China", "India", "Japan", "Pakistan"],
  'capital' : ["Brazilia", "Beijing", "Delhi", "Tokyo", "Islamabad"],
  'population' : [22.3, 200.4, 190, 25.4, 30.4]
  }

# Creating a Pandas DF Object
table = pd.DataFrame(dict)

# Setting the Index
table.index =["BR","CN" ,"IN" , "JP", "PK"] 
table

Unnamed: 0,countries,capital,population
BR,Brazil,Brazilia,22.3
CN,China,Beijing,200.4
IN,India,Delhi,190.0
JP,Japan,Tokyo,25.4
PK,Pakistan,Islamabad,30.4


# Data Access Using Pandas
we'll be exploring how to access our table data using pd & square brackets.

You'll be able to access column data by:


```
table[["column_name"]] # Gives back a dataframe obj
table["column"] # Gives back a Series obj
```

You can access specific row data by:
```
table[["column_name","another_column_name"]]

# Access the 2nd,3rd & 4th row using
table[1:4]
```

- Table[[ row , column ]]

Understanding that the square bracket access are limited, hense pandas has loc & iloc

- loc, is a method which takes in row name and column name as parameter. ex.
```
table.loc[["RU", "IN", "CH"], ["capital","country"]]
```


- iloc, is a method which takes in row index and column index as parameter. ex.
```
table.loc[[1, 2, 3], [0,1]]
```

In [None]:
table

# Acessing the 3,4,5th row using loc
table.loc[["IN","JP","PK"], ["countries","capital"]]

# Acessing every row but only display the population using iloc
table.iloc[:, [2]]

Unnamed: 0,population
BR,22.3
CN,200.4
IN,190.0
JP,25.4
PK,30.4


# Advance Filtering with Numpy & Pandas
Filtering using numpy and pandas is divied into 3 major steps.

- Select the column you need
- Do comparision on that column
- Use result to select countries in table

We'll go over an example below for you to understand how it works

In [None]:
table
# Selecting the population column using loc & Keeping it as a Series
table.loc[:, 'population']

# Do comparsion on the population, eg. population smaller then 35 million
table.loc[: , 'population'] < 35

# Select countries using the above calulcation
cod = table.loc[: , 'population'] < 35
table[cod]

Unnamed: 0,countries,capital,population
BR,Brazil,Brazilia,22.3
JP,Japan,Tokyo,25.4
PK,Pakistan,Islamabad,30.4


# Shortcut To Filtering
We've gone over and broken down the filter process with 3 steps, however this can be done in one line as well. 

We simply put all the three steps in one line.

In [None]:
import numpy as np
table[np.logical_and(table.loc[: ,'population'] < 30)]

# Looping Over DataFrames
We'll explore the idea of looping over our dataframe objs or series objs. 

First, as pandas obj is similar to a 2-D Array; let's try out the basic for loop


```
for val in table:
  print(table)

#Output: 
country
capital 
population
```
We've understood that a basic for loop, will only print out the column labels.


Better to use **iterrows** method, which gives two values. One for label and one for row.


In [None]:
for lab, row in table.iterrows():
  print(lab)
  print(row)

We can also do a Selective print by just specifying which row we need.

In [None]:
for lab, row in table.iterrows():
  print(f"{lab}: { row['capital'] }")

BR: Brazilia
CN: Beijing
IN: Delhi
JP: Tokyo
PK: Islamabad


# Adding New Columns
We've understood the concept of looping over table data, now we go to adding columns

There are numerous ways to add column to an existing table. One is slower compared to the other. 

However, we'll go through all of them.

In [None]:
for lab, row in table.iterrows():
  # Creating Series on every iteration - These oberservation will be the len of the countries
  table.loc[lab, "name_length"] = len(row['countries'])

print(table)


   countries    capital  population  name_length
BR    Brazil   Brazilia        22.3          6.0
CN     China    Beijing       200.4          5.0
IN     India      Delhi       190.0          5.0
JP     Japan      Tokyo        25.4          5.0
PK  Pakistan  Islamabad        30.4          8.0


A better, faster and effiecent way of doing the same thing is by using the apply method

In [None]:
table['name_length'] = table["countries"].apply(len)
table

Unnamed: 0,countries,capital,population,name_length
BR,Brazil,Brazilia,22.3,6
CN,China,Beijing,200.4,5
IN,India,Delhi,190.0,5
JP,Japan,Tokyo,25.4,5
PK,Pakistan,Islamabad,30.4,8


# Setting up csv file
After importing the csv file as a pd dataframe, we can access the whole table using
```
dataset.head()
dataset.info()
```

# Create dictionary via List With Column Name

In [None]:
# Create the years and durations lists
years = [2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020]
durations = [103, 101, 99, 100, 100, 95, 95, 96, 93, 90]

# Create a dictionary with the two lists
movie_dict = { "years" : [y for y in years],
              "durations": [d for d in durations]
}

# Methods For DataFrame Obj
Anything called without the parenthesis are attributes of that obj

```
dataset.head() # Gives the first few rows of the data for inspection
dataset.shape # Gives the dimensions of the obj
dataset.info() # Gives the names of column & data types, along with missing data
dataset.describe() # Calculates the mean, count, min, std etc
dataset.values # Gives us a 2-D array of the dataset
dataset.columns # Gives column names
dataset.index # Gives row names
```



# Sorting The DataFrame


```
#Sorting singular column
dataset.sort_values(<column_name>, ascending=True or False)


# Sorting numerous columns & ascending condition as well
dataset.sort_values([<column_name>, <column_name], ascending=[True False])

```

```
# SubSetting DataFrame With Multiple Conditions

guest_present = dataframe["has_guest"] == True
viewership_high = dataframe["viwership_mil] > 20
dataframe[guest_present & viwership_high]

# One Liner
dataframe[ (dataframe["has_guest"] == True) & (dataframe["viwership_mil] > 20)]
```

We can also using the isin Method by:

```
dataframe["color"].isin(["black","blond"])
```

# Filter The DataFrame As Row


```
dataframe[dataframe[<column_name>]]

or

dataframe[dataframe[<column_name>].isin(<cols_name_array>)]
```



# Summary Stats
Mean
- It is one was of telling you where the center of your data is


```
dataframe["<col_name>"].mean()
```


---
# Aggregate
agg
- It is used to calculate the specific percentile of column data


```
dataframe["<col_name>"].agg(pct30)

```


---
# Cumulative sum
- cummax()
- cummin()
- cumprod()
- cumsum()



```
dataframe["col_num"].cumsum() # Add's each value and returns on each index
```


---
# min // max


```
dataframe["col_num"].min()
```


---






# Counting
- Dropping duplicate names


```
dataframe.drop_duplicate(subset=<col_name>)
or
dataframe.drop_duplicate(subset=[<col_name>, <col_name> ])
```
---


- counting the values
We can give more parameters to this method like: sort/normalize

```
dataframe[<col_name>].value_counts()
```