**<span  style="color:green; font-size:40px">01. Intro to Pandas</span>**

# 01. What is Pandas?

* Examples of groupby, pivot, tidying, timeseries-resample

**<span  style="color:green; font-size:40px">02. Selecting Subsets of Data</span>**

# 1. Pandas Intro
* What is Pandas
* Why numpy is faster
* %timeit
* Index | Columns | Data(values)
* Columns or 1 | rows or 0
* type(x) -->  pandas.core.frame.DataFrame ---> Package | Subpackage | Modules | Class
* Common Data Types (Boolean, float, integer, Object, DateTime)
* Missing values (NaN | None(object) | NaT)
* Converting object columns into DateTime by using `parse_dates`
* Different bit of memory used for different data types 
* **Metadata** - size, shape, info(), info(memory_usage='deep'), dtypes

# 2. Setting a Meaningful Index
* Extract Index: df.index | Extract Columns: df.columns | Extract Data: df.values
* Same extractions can be done for series
* Use `index_col` to set the index and read df simultaneously `movie = pd.read_csv('../data/movie.csv', index_col='title')`
* Index can be selected by providing a range of slices eg-`idx2[100:120:4]`, or a list of index values eg- `idx2[list]`
* We can set index later (after reading the dataframe) by using `set_index`
* Increasing column width etc use `pd.options.display.` To reset we have to use `pd.reset_option('all')`
* To change number of columns `pd.set_option('display.max_columns', 40, 'display.max_rows', 8)`

# 3. Making the most of a Jupyter Notebook

# 4. Selecting Subsets of Data from DataFrames with just the brackets
* Select subsets of data by using `[ ]` | `.loc` | `iloc`

# 5. Selecting Subsets of Data from DataFrames with `.loc`
3 way selection
* In case of only 1 row and 1 column selection: `df.loc[row_name, column_name]`
* In case of multiple row and multiple column selection.
    * Step 1. Make a list of rows to be selected with variable name `rows` and list of columns with variable name `columns`
    * Step 2. `df.loc[rows, columns]`
* In case of a sequence of row selection use slicing. **For slicing we do NOT need extra square brackette** i.e. `df.loc[row1:row5, column_name]`. Here we are providing a list of columns


* We can also provide step-size while using `.loc`


* **Weird part of `.loc`**: `.loc` always includes the last value when slicing

# 6. Selecting Subsets of Data from DataFrames with `.iloc`
Same 3 way selection as `.loc`. Instead of row/column names we will use integer location here.

# 7. Selecting Subsets of Data - Series
* Same `.loc` | `.iloc` selection of series similar to dataframe 
* Only `[ ]` selection is also possible like dataframe

# 8. Boolean Indexing Single Conditions
* For filtering `.iloc` does NOT work
* `[ ]` or `.loc` has to be used in case of filtering

# 9. Boolean Indexing Multiple Conditions
* Using `and (&)`, `or (|)`, and `not (~)` logical operators
* `isin` use when filtering several conditions within a particular column. example ` movies['content_rating'].isin(['R', 'PG-13', 'PG'])`

# 10. Boolean Indexing More
* Convert a column containing DateTime into DateTime datatype by using `parse_dates`
* To find whether column values exist between 2 numbers, use **`between`** i.e. `filt = series.between(n1, n2)` leads to boolean series. Then `df.loc[filt]` or `series.loc[filt]`
* Use `.isin` if we are trying to filter by multiple elements of a column 
* `.isna` and `isnull` are same

**<span  style="color:green; font-size:40px">02. Essential Commands</span>**

**<span  style="color:red; font-size:30px">Series</span>**

# 1. Series Attributes and Statistical Methods
* `series.size`
* `series.index`
* `series.values`


**Aggregation methods:** While calling aggregation methods, pandas ignore missing values
* `sum` example. `series.sum()` 
* `min`
* `max`
* `mean`
* `median`
* `std` - standard deviation
* `var` - variance
* `count` - returns number of non-na values | `len(series_name)` give total number (including null values
* `describe` - returns most of the above aggregations in one Series
* `quantile` - returns given percentile of distribution
* Multiple aggregation in 1 time: `series.agg(['min', 'max', 'median'])`


**NON-Aggregation methods:** This will return a new series
* `abs` - takes absolute value
* `round` - round to the nearest given decimal
* `cummin` - cumulative minimum
* `cummax` - cumulative maximum
* `cumsum` - cumulative sum
* `rank` - rank values in a variety of different ways
* `diff` - difference between one element and another within a series
* `pct_change` - percent change from one element to another in a series 
* `series.diff(-1)` or `series.pct_change(-1)` to (reverse it)

==================================================================================================

* Finding **percentile** of a series: `series.quantile(q=.999)` will output the 99.9th percentile of the series
* In series, `skipna` is True by default. But we can change it to False for specific reasons.
* For a series, if we need to find out biggest to lowest (vice-versa) and give a rank to it, use `rank`
* `diff`- difference between one element and another in a series
* Reverse the `diff` of a series 
* Substract a series element wise with stepsize by using `period` with `diff` or `pct_change`

# 2. Series Methods More
**How to handle missing values**
* **`isna`** - Returns a Series of booleans based on whether each value is missing or not
* **`notna`** - Exact opposite of **`isna`**
* **`fillna`** - fills missing values in a variety of ways
* **`dropna`** - Drops the missing values from the Series
* **`.count()`** method says number of **non-missing** values
* **`.isna().mean()`** tells us % of missing values

==================================================================================================
* `.sort_values()` to sort series values
* `.sort_index()` will sort series index
* `.sample()` will select a part of the data by mentioning number. And `frac` of data can be also collected
* `series.sample(10, replace=True)`: replace=True allows to repeat a value. Good way to make the sample size bigger than the original size
* `.idxmax()` index of maximum series value
* `.idxmin()` index of minimum series value
* `.unique()` gives unique values of a series
* `.nunique()` gives no. of unique values, By default, nunique does not count missing values i.e. NaN. 
* `drop_duplicates(keep='first')` removes duplicate inputs in the series, `drop_duplicates(keep=False)` will remove any duplicate value

# 3. String Series Methods
* `value_counts()` for object or string columns | However `value_counts()` works for all data types 
* `value_counts(normalize=True)` to get % instead of counts
* Method specific for string column can be accessed by `series.str.method()`
 
 **`.str` specific methods** [str accessor API][1] 
* `series.str.count('x')` will count occurances of `x` or any other passed string
* `series.str.contains('x')` will give a boolean series of whether `x` or any other passed string is present or not
* `series.str.find('x')` will give lowest index value where `x` or any other passed string is present. -1 if that string is not found 
* `series.str.len()` gives length of string elements in the series



* common pandas and python `.str` methods
* `series.str.split()` split string on spaces or anything else that is passed to `split()`
* **`series.str.split(expand=True)`** will cause splited string elements into a dataframe
* replace `x` by `y` by using `series.str.replace('x', 'y')`

[1]: http://pandas.pydata.org/pandas-docs/stable/reference/series.html#api-series-str

# 4. Datetime Series Methods
* An entire column/series of datetime values is `datetime`. Each entity in that series is `timestamp`. Difference between 2 `datetime` series or`timestamp` generates `timedelta`

* Use `parse_dates` to convert columns (that can be datetime) into a datetime column

<br>

<br>

* Use `series.dt.attributes` (example of attributes: `year`, `month`, `minute`, `dayofweek` etc) [Visit the API for datetime attributes][1]
* Web-page embedding in jupyter notebook using `IFrame`




* Datetime (`series.dt.methods`) methods: `ceil` | `floor` | `round` | `strftime` | `to_period` (use with [offset aliases][2]) `series.dt.ceil('H')`
* Offset aliases example - `D` - day | `H` - hour  |  `T` or `min` - minute  | `S` - second
<br>

<br>

* **Timedeltas** is an amount of time. Can be produced by substracting 2 datetime series. Can access `.dt`
* **Period** represents a period of time eg- entire month or year or minute. `series.dt.to_period('M' OR any other offset alias)` to convert a datetime into period. Period also has `.dt` accessor
* **[strftime][3]** stands for **str**ing **f**ormat **time**- converts a datetime to strings format time by using `series.dt.strftime('%A, %B %d, %Y at %X')`

[1]: http://pandas.pydata.org/pandas-docs/stable/reference/series.html#api-series-dt

[2]: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases

[3]: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior

**<span  style="color:red; font-size:30px">DataFrame</span>**

# 5. DataFrame Attributes and Methods
* `index` | `columns` | `values` | `dtypes` | `shape`(rows, columns) | `size` (row x column)
* deafult index is `rangeindex`
* To selective specific datatype columns of a dataframe use `df.select_dtypes('number')` or `object` or `datetime` etc or provide a list `['int', 'object']`
* `pd.options.display` chnage appearances
* common methods and attributed of series and df
* Unique methods of series and dataframe

# 6. DataFrame Descriptive Statistic Methods

**Aggregation methods:** While calling aggregation methods, pandas ignore missing values
* Some of the following methods could be used to the entire df even with object columns. Just pass `numeric_only=True` as an argument inside the method
* `sum`
* `min`
* `max`
* `mean`
* `median`
* `std` - standard deviation
* `var` - variance
* `count` - returns number of non-na values
* `describe` - returns most of the above aggregations in one Series
* `quantile` - returns given percentile of distribution


**NON-Aggregation methods:** This will return a new series
* `abs` - takes absolute value
* `round` - round to the nearest given decimal. Applicable to entire df. Will ignore object columns automatically
* `cummin` - cumulative minimum
* `cummax` - cumulative maximum
* `cumsum` - cumulative sum
* `rank` - rank values in a variety of different ways
* `diff` - difference between one element and another
* `pct_change` - percent change from one element to another in a series (reverse it)


`axis`= VERTICALLY `axis = 0(or 'index')` | HORIZONTALLY `axis = 1(or 'columns')`
`df.describe(include='object')` summary statistics on string columns <br>
`df.describe()` summary statistics on numerical columns

# 7. DataFrame Methods More
* `sort_values()` | `sort_index()`
* **Dealing with missing values:** `isna`, `notna`, `fillna`, `dropna`
* Delete NaN rows `df.dropna()`, delete columns with NaN `df.dropna(axis='columns')`, delete rows based on NaN values of a particular column `df.dropna(subset=['content_rating'])`
* `fillna` could be `ffill` / `bfill` with or without limit `df.fillna(method='ffill')` | `df.fillna(method='ffill', limit=1)`
* Use dictionary to fillna multiple columns at once: `df.fillna({'col1': 'PG', 'col2': 199})`
* Dropping all rows of a dataframe based on NaN values of `x` column: `df.dropna(subset=['x'])`
* Finding corrosponding index of maximum and minimum values **`idxmax`** and **`idxmin`**
* idxmax for all number columns `df.select_dtypes('number').idxmax()`
* Rename column by **`rename`** `college.rename(columns={'col_name': 'new_col_name','col2': 'new_col_name2'})`
* Drop column by **`df.drop(columns=[])`** and rows by **`drop(index=[])`**
<br>
<br>
* Drop duplicate rows: **`drop_duplicates`**. `df.shape` and `df.drop_duplicates().shape` to check how many exact same rows are in the dataset
* Dropping duplicate rows based on 1 columns: use `subset` with **`drop_duplicates`**
* **`pd.to_numeric`** to coerce string columns to numeric data types
* Sort dataframe by two columns `df.sort_values(by=['a', 'b'])`
*  Sort dataframe by two columns in different direction`df.sort_values(by=['a', 'b'], ascending=[True, False])`
* Finding index of maximum value of each numeric columns `df.select_dtypes('number').idxmax()`
* `.is_alpha()` to check whether elements are string or not. But in case it is not string it will return NaN instead of False. So to convert it into boolean series, use `fillna` to False
* Insert a series/column to a particular position in the dataframe by **`df.insert(position, new_column_name, values)`**
* Series Methods not found in DataFrames - `str`, `dt`, `unique`, `value_counts`

# 9. DataFrame Methods More II
* Drop duplicate rows: **`drop_duplicates`**. `df.shape` and `df.drop_duplicates().shape` to check how many exact same rows are in the dataset
* use `subset` with **`drop_duplicates`** to delete rows when a values of a particular columns are repeated
* **`CLIP`**: works on both df and series. It can put a lower and/or upper boundary to numeric columns example- `df.clip(50000, 100000)` OR `df.clip(upper = 50000)`
* Insert a series/column to a particular position in the dataframe by **`df.insert(location, name of new column, series_Values)`** -- To find the location of the column after which we want to insert the new column use: `df.columns.get_loc('column name')`. **location** would be +1 to that
* **`replace`** can work with both numeric and string columns in series or dataframe. `df.replace(to_replace=number/string, value=replacing string/number)`
* For multiple **replace** in 1 step use a dictionary `df.replace(to_replace={'White':'WHITE', 50000: 11111, 70000: 99999})`
* To replace **substring** (part of the text), we have to use **regex=True** `df.replace(to_replace={'Department':'dept'}, regex=True)`
* `df.nlargest(5, 'salary')` , other than `sort_values` we can use `nlargest` or `nsmallest`
* `df.sort_values('salary').drop_duplicates(subset=['race', 'sex'])`

# 10. Changing Data types
* pass `'float'` | `'int'` | `'str'` as an argument into Series/df.`astype()` eg- `df['col_name'].astype('str')` OR `df.astype('str')`
* if we convert an int to bool, 0 will be False and rest numbers will True


* If any numeric column has dtype as 'object' then `sort_values(ascending=False)` to check whether strings are really present in the column (strings will move up above numbers because numeric characters have lower unicode point tha alphabetic character.) Convert that kind of columns by using **`pd.to_numeric(objectseries/column_name, errors='coerce')`**

# Case study: Calculating normality
* Checking normality using Z-score
* For a boolean series, `series.all()` returns True if all the series value is True otherwise False. `series.any()` will return True if any value in the series is True

**<span  style="color:green; font-size:40px">03. Grouping</span>**

# 1. Groupby Aggregation Basics
* `df.groupby('<grouping column>').agg({'<aggregating column>':'<aggregating function>'})`
* after grouping, use `.reset_index()` if required to 
* `.ngroups` produces number of groups


* **Aggregation functions used with groupby**
    + **`sum`**
    + **`min`**
    + **`max`**
    + **`mean`**
    + **`median`**
    + **`std`**
    + **`var`**
    + **`count`** - count of non-missing values
    + **`size`** - count of all elements
    + **`first`** - first value in group
    + **`last`** - last value in group
    + **`idxmax`** - index of maximum value in group
    + **`idxmin`** - index of minimum value in group
    + **`any`** - checks for at least one True value - returns boolean
    + **`all`** - checks for at least one False value - returns boolean
    + **`unique`** Gives a list of unique values
    + **`nunique`** - number of unique values in group
    + **`sem`** - standard error of the mean

# 2. Grouping and Aggregating with Multiple Columns
* Grouping with multiple columns
* To find `size` or number of total elements (Aggregation): `df.groupby(['race', 'gender']).size()`: This will give a series
* `df.groupby(['race', 'sex']).size().reset_index(name='size')` will give a dataframe
<br>
<br>


* `df.groupby(['race', 'sex']).agg({col1:['min','max], col2:['idxmax','idxmin']})`

# 3. Grouping with Pivot Tables [and Styling]
* pivot_table is better alternative than performing groupby with 2 different columns
* `df.pivot_table(index='col1', columns='col2', values='col3', aggfunc='mean')`
    * `index` - grouping column
    * `columns` - grouping column
    * `values` - aggregating column
    * `aggfunc` - aggregating function (defaulted to the mean)
* For finding size, `values` is not needed `df.pivot_table(index='col1', columns='col2', aggfunc='size')`
* Multiple columns `df.pivot_table(index=['col1','col2'],columns=['col3','col4'],values='salary',aggfunc='median')`
    
    
* Use `astype('int')` | `round(x)` to reduce noise in resulting pivot table
* `aggfunc` similar to Aggregation functions used with groupby


* **Styling**: To highlight data
    * `style.highlight_max()`
    * `.style.highlight_max(axis=None)` gives the highest value of the entire dataframe | `axis=0`-highest value across each column | `axis=1` - highest value across each row
    * To highlight both maximum and minumum value each row `.style.highlight_max(axis='columns').highlight_min(axis='columns', color='lightblue')`
    * `style.background_gradient(cmap='YlOrRd')`
    *  To put comma in numbers use `.style.format('{:,.0f}')`
    
    
* Styling documentation: https://pandas.pydata.org/pandas-docs/stable/style.html

# 4. Counting with Crosstabs
* `pd.crosstab`
* Frequency counting with a Series USE `value_counts()`


* **Groupby**
    * `df.groupby(['col1', 'col2']).size()`
    
* **Pivot Table**: Multiple column cab be used 
    * `df.pivot_table(index='col1', columns='col2', aggfunc='size', fill_value=0)`
    * `df.pivot_table(index=['col1','col2'],columns=['col3','col4'],values='salary',aggfunc='median')`
    * for size we do not have to put `values` parameter
<br>
<br>
* **CrossTab**: only size and normalize can be done
    * `pd.crosstab(index=df['col1'], columns=df['col2'])`
    * Use `normalize='all'` | `'columns'` | `'index'`
    * Use `margins=True`
    
    
* Multi-index crosstabs
    * list of index: `index=[df[col1], df[col2]]` | `cols = [df[col3], df[col4]]`
    * Then crosstab: `pd.crosstab(index=index, columns=cols)`

# 5. Reshape MORE

* If we have a multi-level index due to groupby, we can use `unstack` to make 1 of the index into columns 
* `df.groupby(['cola', 'colb']).size().unstack(level='colb', fill_value=0)`



* Use **`chi2-contingency`** to check statistical significant between 2 groups after cross-tab
    * **step1**. Do crosstab between 2 groups:`df=pd.crosstab(index=df['col1'], columns=df['col2'])`
    * **Step2**. Check whether the difference between 2 groups are statistically significant by
        * `from scipy.stats import chi2_contingency`
        * `chi2_contingency(df)`: It wil give chi2 test statistic | p-value | degrees of freedom | expected counts
        * if p-value is essentially 0, giving us tremendous confidence that these two group counts are indeed different. 


* Use `fill_value=0` everytime you see a `NaN` after groupby | pivoit_table | crosstab

# 6. Create Your Own Data Analysis
#### Places to find data
* [Kaggle datasets][1]
* [data.world][2]
* Most large US cities, [NYC][3], [Denver][4], [US Gov't][5], simply do a web search for "open data [US city]"



[1]: https://www.kaggle.com/datasets
[2]: https://data.world/
[3]: https://opendata.cityofnewyork.us/
[4]: https://www.denvergov.org/opendata
[5]: https://www.data.gov/

# 7. Alternate GroupBy Syntax
* Other syntaxes of groupby- DO NOT use / no need top remember
* `size` | `count` for groupby -  we do NOT need to put a column under `.agg` example- `df.groupby(['col1', 'col2'}).size()` or use `count()` . For size it will generate only 1 column. For count it will generate multiple column representing not-null values of each column

#  8. Custom Aggregation
* Writing customized aggregation function for series. Must output a single aggregated value. This can be also used for groupby
    * `def min_max(s):`<br>
      `return s.max() - s.min()`
    * ex- `df.groupby('col1').agg({'col2': min_max})` custom function (only aggregation function possible) not within strings
    * As we are passing `distance` that means we are passing a series of each group to the function, so custom function would be like<br> 
    `def min_max(sub_series):`<br>
    `return sub_series.max()-sub_series.min()`
    
    * we can use lambda function like: `flights.groupby('airline').agg({'distance': lambda sub_series: sub_series.max()- sub_series.min()})`
    
    
* Find first and last rows of each group `flights.groupby('airline').nth([0,-1])` - several other functions available in pandas https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html

# 9. Transform and Filter with GroupBy
* **Filtering** -End result: Original dataframe with certein groups are filtered out
    * Write custom functions that will return a boolean value. Here we pass sub_df example
        * `def find_tot(sub_df):`
        * `return sub_df[col2].sum()>15` then `df.groupby('Col').filter(find_tot)`
        * `df.groupby('Col').filter(CUSTOM_FUCNTION)`
        <BR>
    
        <BR>
    * OR `df.groupby('Col').filter(lambda sub_df: sub_df['col2'].sum() > 15)` notice no `.agg` used
    * Find actors that appeared in at least 25 movies--> Groupby actor, then groupby size of each group and select groups with size more than 25
<BR>
                        
<BR>                      
* **Transform** - do NOT aggregate rather TRANSFORM. Helpful when we want to divide each row of a group by the maximum value of that group. Or divide every value of a particular column in a group by the maximum value of the column in that group
    * `df.groupby('item')['quantity'].transform(custom function)` As we want to transform a series ed-`quantity` here custom function should be sub_series like filter method---
    * `def min_max(sub_series):`<br>
      `return sub_series-sub_series.min()`

# Extra-Case Study: Counting Pandas [Series & Dataframe attributes/methods]
* How to extract tables from a .html
* How to extract tables from a html which contain a specific word using RegEx `match`


**<span  style="color:green; font-size:40px">05. Time-Series</span>**

# 1. Datetime(TimeStamp) and Timedelta
* **date** - Just the Month, Day and Year.
* **time** - Just the Hours, minutes, seconds and parts of a second (milli/micro/nano). 


* **DateTime | TimeStamp**:`DateTime`(Python) is `TimeStamp`(This is specific to Pandas)
* Use `.dt` accessor with datetime series, NOT with individual datetime object
* `pd.to_datetime` converts strings and numbers to Timestamps example: `pd.to_datetime('July 20, 1969 2:56 a.m. 15 seconds')`
* datetime - Has both date (Year, Month, Day) and time (Hour, Minute, Second). Time goes upto Nanosecond
* Make a datetime from epoch: `pd.to_datetime(20000, unit='d')` unit can be `'s'`, `'h'`, `'d'` or `pd.to_datetime('today')`. Then we can use attributes directly, no need to use `.dt` accessor like we do in case of a datetime series
    * example `day = pd.to_datetime('Jan 15, 1997')`
    * `day.day_name()`
* **To covert a column into DateTime** to start with use: `pd.read_csv('x.csv', parse_dates=['hire_date'])`
<br>

<br>

* **TimeDelta**: Amount of time (dont include date part. Extends from date to nanoseconds). Substracting 2 datetime or timestamp will generate TimeDelta
* `pd.to_timedelta('5days 10h')` converts strings and numbers to Timedeltas thus 1 year would be `pd.to_timedelta(1, unit='y')` unit can be `'s'`, `'h'`, `'d'` and **`'y'`**

# 2. Intro to Time Series-DateTime index
* JSON data from the IEX trading API `pd.read_json()`
* Use `.dt` accessor
* Convert datetime column to `datetime index` to get the advantages of an index
    * Big advantage is **partial selection**, imagine index is in format `2017-04-06`. If we use `df.loc['2017-04-06']` then that row will be selected. If we use `df.loc['2017-04']` entire month would be selected. `df.loc['2017-01':'2017-02']` will select Jan and Feb of 2017. `df.loc['2017']` will select entire year. 
    
    
* **DateTime sampling** sampling possible ONLY if it is in index 
    * `weather = pd.read_csv('../data/weather.csv', parse_dates=['date'], index_col='date')`
    * If we want to **sample** from the entire `datetime index` (example-last day of each month) we can use **`asfreq`**. This only works with **`DateTimeIndex`**
    * We need to pass **offset alias** with `asfreq` http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases `df.asfreq('BA')` example for selecting final business day annually
    * Based on need, we might have to pass **anchored offset alias** `df.asfreq('W-FRI')` example for sampling every friday. We can further customize by `df.asfreq('6W-FRI')` sampling every 6th Friday


* **Upsampling and Downsampling**
    * During upsampling (suppose data is for every day and I am sampling every 4 hour) it can lead to a lot of NaN value, use `ffill` to remove them example-`df.asfreq('4H', method='ffill')`


* **Custom offset alias**: Custom business days (As some weekdays are holidays)


* **Creating date ranges** example -`pd.date_range(start='1/1/2012', periods=10, freq='20S')` to  create 10 values begining with January 1, 2012 every 20 seconds. We can replace `periods` with `end` i.e. `pd.date_range(start='1/1/2012', end='10/1/2012', periods=8)`

# 3. Grouping by Time
* Grouping of time by **`resample`**. Resample should be passed with an **offset alias** [Link](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases)
    * example- `df.resample('M').agg({'column': 'mean'})`   
<p> </p>      
* For grouping to work (resample), we need to have a Datetime Index or we have to pass a `on` parameter to `resample` specifying datetime column example-`df.resample('M', on='date_col').agg({'column': 'mean'})`   
<p> </p>   
* If we are grouping by a month or quarter then labelling(index) it as a single date would NOT make sense. We can label it as a **period** `df.resample('Q', kind='period').agg({'col': ['size', 'min']})`
<p> </p>   
* we can convert a datetime column to a period column `df['datetimecolumn'].dt.to_period('M')`. `.dt` accessor also works on period column
<p> </p>   
* **Anchored offset:** When we resample, it gets aggregated by a specific day (eg month end , or sunday for weekly aggregation). We can anchor by another day example- `df.resample('W-WED')` | or every 5 months (`df.resample('5M'`) | every 22 weeks anchored to Thursday (`.resample('22W-THU')`)
<p> </p>
* To resample based on a datetime column and then convert it into a period:<br>
`df.resample('w', kind='period').agg({'column':'sum'})`
<p> </p>
* `.dt` accessor works for period columns

* Displaying any website in the jupyter notebook <br>
`from IPython.display import IFrame`<br>
`IFrame('url', width=800, height=500)`<br>

# 4. Rolling Windows
* Rolling windows is similar to resample but NOT exactly similar . 

* **Rolling windows rules**
    * Use **datetime index** | or use `on` parameter
    * for rolling windows dates should be sorted
    * Try using days or `'D'` in offset alias. i.e. '`365D'` for 1year or 12 month rolling window
    

* `df.rolling('5D').agg({'column': ['mean', np.size]})`WITH offset alias


* OR `df.rolling(5).agg({'column': ['mean', np.size]})` WITHOUT offset alias. In this case to prevent `NaN` use `min_periods=1`. We can use `center=True` i.e. `df.rolling(5, min_periods=3, center=True).agg({'column': ['mean', np.size]})`


* **Series resample | rolling window** replace rolling by resample | we can use offset alias<br>
    * `series.rolling(5).mean()` | `series.rolling(5).agg(['mean','min'])`
    * Only rolling requires **`np.size`**. Resample can use `'size'`
    

* while reading a csv file, we can `parse_date` to make a column datetime column and use `index_col` the same column to make that an column an index in 1 line of code

# 5. Grouping by Time & another Column requires GROUPER
* Alternative to `resample` is `groupby+Grouper`. 
 * **1st step:** `tg = pd.Grouper(freq='5Y')`
 * **2nd step:** `df.groupby(tg).agg({'salary':'mean'})`
 * If datetime is just a column and not index then `tg = pd.Grouper(freq='5Y', on='date_col')`
 
 
* For multi-level grouping. Use non-datetime column atfirst and then use datetime column.


* **Group together** by using `grouper` and then `groupby` and **Grouping independently** by using `groupby` on non-datetime column and `resample` on date-time column might result **in different outputs**


* The 3 ways it can be done (Multi-level grouping)

 * **1st type:**`df.groupby('column1').resample('10A').agg({'column2':'mean'})`
 <p></p>
 * **2nd type:** `tg = pd.Grouper(freq='10Y')` <br>
`groups = ['column1', tg]`<br>
`df.groupby(groups).agg({'column2':'mean'})`
<p></p>            
 * **3rd type:** For multi-level grouping we can just use `pivot_table`<br>
 `tg = pd.Grouper(freq='10Y')`<br>
 `df.pivot_table(index=tg, columns='column1', values='column2')` # `aggfunc=mean` by default

# 06 Python  DateTime Module
* Converting datetime into string by **strftime** | `dateutil` -- `relativedelta` an upgradation over `timedelta`

**<span  style="color:green; font-size:40px">06. Regular Expression</span>**

# 01-04 Intro to Regular Expression-Character set & Grouping

* **`contains`**: To check whether a match presents or not, use **`contains`**. This gives boolean series.<p></p> 
    * **Literal character matching**: `series.str.contains('star')` to match if series contains string 'star'  
    * whether 'x', 'y', or 'z' present or not, use `series.str.contains('x') | series.str.contains('y') | series.str.contains('z')`  OR `series.str.contains('[xYZ]')`
<p></p>  
<p></p> 
    * **Special characters**: ` . ^ $ * + ? { } [ ] \ | ( )`
        * `.` dot matches any character. `series.str.contains('m.le')` will match male, mule, smile etc <p></p> 
        * `^` will force pattern to match at the beginning of the string. `series.str.contains('^m.le')` will match male, mule but NOT smile<p></p> 
        * `$` force pattern to match end of the string. `series.str.contains('war$')` will match civil war, not warcraft<p></p> 
        * combining special character `series.str.contains('^s.n')` will match san, sin, son, sanfrancisco etc <p></p> 
        * `*` matches **0 or more** eg  `series.str.contains('ah*no')` could be ano, ahno, ahhhhno. <p></p> 
        * `+` matches **1 or more** thus eg  `series.str.contains('ah+no')` could be ahno, ahhhhhno. <p></p> 
        * `?` matches **0 OR 1** time eg  `series.str.contains('ah?no')` could be ano, ahno<p></p> 
        * `a{3,}` matches 3 or more a in a row. `a{,3}` will match 0 to 3 a in a row. `a{3,5}` matches 3 to 5 a in a row<p></p> 
        * `|` matches OR condition `series.str.contains('Friend|Enemy')` will match any string which contains Friend OR Enemy in it<p></p> 
        * `[ ]` match one of the character inside it. `series.str.contains('T[aei]d')` would match Tad, Ted or Tid <p></p> 
        * `[0-9 ]` represent any digit | `[a-z ]` represent any lowercase character | `[A-Z]` represent any UPPERcase character | `[a-zA-Z]` represent any lower/upper case character.
<p></p>  
    * **Special characters within `[ ]` and `\` loses it's meaning**: 
        * ` . ^ $ * + ? { } [ ] \ | ( )` loses its meaning within `[ ]`. `[.]` means literal dot. `[*$]` means literal * and $ <p></p>
        * ` . ^ $ * + ? { } [ ] \ | ( )` loses its meaning within `\`. **`\*`** means literal *<p></p>
        * `x[^aeiou]`: Here ^ within [ ] means it will match everything other than what is present within [ ]. x will be followed by anything other than aeiou. Similarly  `x[^A-Za-z]` means x will be followed by a digit<p></p> 
<p></p>  
<p></p>  
    * **`\d`**: all digits <br>
    **`\D`**: all non-digits <br> 
    **`\s`**: whitespace <br>
    **`\S`**: Non-whitespace <br>
    **`\w`**: word character (upper or lower), digit, underscore. equivalent to [A-Za-z0-9_] <br> **`\W`**: any Non-word character eg- `^W+` startes with 1 or more non-word character<br>
    **`\b`**: word-boundary  eg - `'^(In|My)\b'` will match words starting with In or My. `\b` will ensure In or My are separate words and not part of a bigger word<br>
    **`\B`**: Non-word boundary<br>
    * **`?:`** is **non-capturing group**. eg- `'^(?:In|My)\b'`. Non-capturing group is useful for extraction
<p></p>  
<p></p>     
*  **`count`**: `series.str.count('T[aei]d')` counts number of times the pattern appears
<p></p>  
<p></p> 
* **`extract`** or **`extractall`**: To extract the pattern use **`extract`** or **`extractall`**<br>
    * **`extract`** Must have the pattern we want to extract within `( )` i.e. capturing group. If we want a pattern to match but not extract a part of that pattern to be extracted, use non-capturing group `(?:pattern)` <p></p>
    *  `series.str.extract(pattern, expand=False)` will result into a series. If `expand=True`, it will generate a dataframe

# Case Study- Feature Engineering Titanic

* to **extract first character** of a string series - 
    * `string_series=df['string_col']`<br>
    * `string_series.str[0]` | **Finding length of string** `string_series.str.len()`

**<span  style="color:green; font-size:40px">07. Tidy data </span>**

# 1. Tidy data with `melt`
* Each variable (Like Male and Female should be within a single column and not form separate columns) should form a single column. Variable is anything that is liable to change


* **Melting is done to stack 1 column below another**
<p></p>
* `df_melted = df.melt(id_vars='col_name', value_vars='col_list' var_name='new melted col name', value_name='new col_name' )`
    * `id_vars`= to the column name that will not change from original df
    * `value_vars` = should be the list of columns we want to stack 1 below another
    * `var_name` = Name of the new melted column
    * `value_name` = Name of the newly generated number column

# 2. Reshape by `pivot` and `pivot_table`
* **`pivot`**: Opposite of `melt` is `pivot`
    * Pivot will put **unique elements of a single column into separate columns**
<p></p>
    * `df_pivot = df.pivot(index='col_name', columns='target_col', values='col')`
    * `index`= column that will not change from original df and it will become new index
    * `columns` = should be the target column whose unique elements needs to be separate columns
    * `values` = Name of the new column whose values will be redistributed under new columns (after pivot)
     <p></p>   
    * After pivoting index and columns will have few extra labels. so follow the steps-
        * `df_pivot.reset_index()`
        * `df_pivot.rename_axis(None, axis='columns')`
<p></p>

* **`pivot_table`**: if there are duplicate rows, `pivot` won't work and we have to use `pivot_table`. Aggregation of the duplicate rows will be done by `pivot_table`
<p></p>
    * `df_pivot_table = df.pivot_table(index='col_name', columns='target_col', values='col', aggfunc='func')`
    * `values`= the column that has numerical values that will be aggregated

# 3. Common messy datasets
* Various examples of messy datasets and how to clean them
* If we have to keep multiple column constant, we can NOT use `pivot` we have to use `pivot_table`
* `df_pivot_table = df.pivot_table(index='col_name', columns='target_col', values='col', aggfunc='max')` 

# 4. Why Tidy data
* Examples of when Tidy data is better


# Case study: My brothers keeper
* Messy data-> Divide dataframe into 2 part -> Melt and clean up -> Join the melted dataframe by
`df1_melt.merge(df2_melt on=['col1,col2,col3]`. Here col1, col2, and col3 are the common columns in 2 dataframe 

**<span  style="color:green; font-size:40px">08. Joining Data </span>**

# 1. Automatic Index Alignment
* Pandas aligns the data by index (of series and dataframe both) atfirst and then carry forward the operations such as addition, subtraction

* Uncommon index will lead to NaN after operations such as addition

* Duplicate values in index will lead to cartesian product


# 2. Combining data
* **`pd.concat`** is used to concatenate df on top of each other OR side by side
    * ** Concatenating on top of each other**:<br>
    *  `pd.concat([df1, df2], ignore_index=True)`. To prevent original index use `ignore_index=True`
<p></p>
    * `pd.concat([df1, df2], keys=[df1_name, df2_name])`. To have an index stating which dataframe the data came from, we use `keys`
<p></p>
    * `pd.concat([df1, df2], join='inner'])` By default all columns will be kept after concatenating. So if there are some non-common columns between dataframes, then due to automatic alignement of columns will lead to NaN values. `join=inner` only keeps common columns
<p></p>
    * ** Concatenating side to side**:<br>
    *  `pd.concat([df1, df2], axis=1)`
    
# 3. SQL databases


# 4. Data Normalization
* In a dataset sometimes some column values are very repititive. We can create a dimension with few columns that are repititive and somehow related to each other. 

* Next, remove dupliucate rows and add a primary key to that dimension

* Now merge this dimension with original dataframe based on common columns.

* Next drop these common columns from original dataframe, as based on the primary key of the dimension(now added to the original dataframe) I can explore the common columns in the dimension

* This way we can prevent repititive data capture in the original dataframe

**<span  style="color:green; font-size:40px">09. Visualization </span>**

# 01. Matplotlib fundamentals

* Basics of all controlling parts of a plot view
* `fig, ax = plt.subplots(figsize=(a,b))`. The overall plotting part is `fig`. Graph part is `ax`
    * Use `ax` to further optimize other properties-
        * `ax.set_title('abc')` | `ax.set_xlabel('X axis')` | `ax.set_xlim(0,5)` 
        * `ax.set_xticks([1.5, 3, 4.5])` | `ax.set_xticklabels(['a', 'b', 'c'])`
        * `ax.set_title('abc', size=, color=, background=,fontname=, rotation=)`
        * Different colors https://matplotlib.org/gallery/color/named_colors.html
        * `ax.tick_params()`


# 02. Matplotlib Text and Lines
* Adding text in a figure: `ax.text(x=, y=, s='text', color=, size=, rotation=, backgroundcolor=, fontname=)`
* Set/change position by `text = ax.text(all top parameters)` --> `text.set_position((5,4))`
* Horizontal line: `ax.hlines(y=, xmin=, xmax=)` |  Vertical line: `ax.vlines(x=, ymin=, ymax=)`
* Common line properties: **linewidth, color, linestyle**
* Grid line- `ax.grid(linestyle=dashed, color='brown', linewidth=3)` | add `axis='x'` if we want gridline only on x axis
* **Annotate a point with arrow**:
    * `ax.annotate('text', xy=(2,3), xytext=(4,7), arrowprops={'color':'blue'}, size=15)`

# 03. Multiple axes figures
* `fig, ax = plt.subplots(2,3,figsize=(a,b))` will generate 2 rows and 3 columns i.e. 6 plots
* Now we can access all 6 plot specific axis as `ax[0,0]` to `ax[1,2]`
* We can custom optimize each plot i.e. each `ax` based on 01. Matplotlib fundamentals 

# 04. Matplotlib data plotting
* Several plots possible using`ax`
    * `ax.plot(x, y, data=, marker='s', linestyle='--')` many other parameters can be used or altered.
    * `ax.plot()` will generate a line plot. We can make `ax.scatter()` | `ax.bar()` | `ax.pie()`
    
* List of colors
* List of markers
* **Univariate analysis**: `ax.boxplot()`


* **Plotting dates**: `ax.plot_date('datetime col', 'numeric_Col', data=)`

# 05. Plotting with Pandas

* Plot from dataframe or series directly with `series.plot()` | `dataframe.plot()`
* `dataframe.plot()`: Pandas plotting is column based, plots 1 column 1 at a time. *X-axis is the index and y-axis is the column values. Column name will be legend*


* **Change plot types**:
    * `df.plot(kind='hist')` 
    * Other plot types are `line`(default), `bar`, `barh`, `hist`, `box`, `kde`, `area`, `pie`, `scatter`
    
* Additional plotting arguments that could be put inside `df.plot(kind=)` are
    * `linestyle(ls)`, `linewidth(lw)`, `color(c)`, `alpha`, `figsize`, `legend=True`, `title`
    
    
* Overall appearance of the plots: `plt.style.available`. Using one `plt.style.use('ggplot')`-Nice appearence


* **Tidy data**: If we have tidy data we can groupby by 1 column and then make `bar` plots
    * If we have multilevel grouping leading to multilevel index then plotting would be ugly. To avoid multilevel index use `df.pivot_table()`

# 06. Seaborn New
Seaborn API has 5 types of plot. Ted made it simple into 3 types.

* Generic code: `sns.plotting_func(x='col1', y='col2', hue='col3', data=df)`
    * For univariate plot use x or y. hue ( pass a categorical column) adds dimensionality by splitting and coloring data by the categorical column elements
    <p></p> 
    <p></p>
#### Seaborn divides plotting into following groups: seaborn.pydata.org/api.html

    * `Relational` | `categorical` | `Distribution` | `Regression` | `Matrix`
    <p></p>
     * **Relational**: `sns.relplot()` with `kind=scatter`(default) `kind=line`
     <p></p>
     * **Categorical**: `sns.catplot()` <br>
         * **Categorical scatterplot**: with `kind=strip`(default-stripplot), `kind=swarm` (swarmplot)<br>
         * **Categorical distribution plot**:  with `kind=box`(default-boxplot), `kind=violin` (violinplot),  `kind=boxen` (boxenplot)<br>
         * **Categorical estimate plot**:  with `kind=point`(default-pointplot), `kind=bar` (barplot),  `kind=count` (countplot)
     <p></p>
     * **Categorical**: `sns.jointplot()` | `sns.pairplot()` | `sns.distplot()` | `sns.kdeplot()`
     <p></p>
     * **Regression**: `sns.lmplot()` | `sns.regplot()` | `sns.residplot()` 
     <p></p>
     * **Matrix**: `sns.heatmap()` | `sns.clustermap()`  

#### Ted divided plotting into-
1. **Distribution**: For continuous variable `box`, `violin`, `hist`, `kde` <br>
2. **Grouping & aggregating**: Group by a categorical variable and aggregate a continuous variable `bar`, `count`, `point` <br>
3. **Raw data**: `scatter`, `line`, heatmaps 
 
 
* **Distribution plot**:
    * Common: `boxplot`, `violinplot`, `distplot`
    * Others: `stripplot`, `swarmplot`, `boxenplot`, `jointplot`
    * **Univariate distribution plot**: 
        * `sns.boxplot()` | `sns.violinplot()` | `sns.distplot(df['col'], hist=False)`-kde or hist
    * **MULTIvariate distribution plot**: 
        * `sns.boxplot(x=continuous_col, y=categorical_col, data=, ax=ax)`. Now use this `ax` and 01. matplotlib fundamentals to optimize the plot further
        * `sns.violinplot(x=continuous_col, y=categorical_col, data=, ax=ax)`
        * `sns.jointplot(x=continuous_col, y=continuous_col, data=, ax=ax)`
        * Another dimension can be added by using `hue`
      
      
* **Grouping & aggregating**:
    * **Univariate plot**: `sns.countplot(y=categorical_col, data=)`. `hue` can be used too
    * **Multivariate plot- Grouping by a categorical**:
        * `sns.pointplot(x=categorical_col, y=continuous_col, data=, estimator=np.mean, ci=None)` estimator we can use any other function, ci is confidence interval. Another dimension could be added by `hue` <br>
        * `sns.barplot(x=categorical_col, y=continuous_col, data=, estimator=np.mean, ci=None)` estimator we can use any other function, ci is confidence interval. Another dimension could be added by `hue`
        
        
        
* **Raw plots**: Do not change the data. Plot the raw data as it is.
    * `scatterplot` | `lineplot` | `regplot` or `lmplot` | `heatmap` or `clustermap`
    * `sns.scatterplot(x=continuous_col, y=continuous_col, data=, hue=categorical_col, style=categorical_col)`
    * `sns.regplot(x=continuous_col, y=continuous_col, data=)`
    * `sns.lmplot(x=continuous_col, y=continuous_col, hue=, data=, lowess=True)` lowess=True allows locally weighted regression
    
    * **Heatmaps**: Mostly used to see correlation between different variables in the data
        * `sns.heatmap(df.corr(), ax=ax)`
        * `sns.clustermap(df.corr(), ax=ax)`

* **Grid plots** Multiple plots could be created by `sns.relplot()` | `sns.catplot()` | `sns.lmplot()`. These 3 do not create any new type of plots just allows to create a grid of multiple plots <br>
    * `sns.catplot(x=continuous_col, data, kind=, col=categorical_col)` `col_wrap=3` maximum 3 columns in each row
    *  `sns.catplot(x=categorical_col, y= continuous_col, hue=categorical_col, data=, kind='bar', row=categorical_col, col=categorical_col, col_wrap=3, ci=None)`
    

* Same grid plots with `row`, `col` can be done with `relplot()` and `lmplot()`. However here x and y should be continuous
    * `sns.relplot(x=continuous_col, y= continuous_col, hue=categorical_col, data=, kind='scatter', row=categorical_col, col=categorical_col, col_wrap=3)`    
    * `sns.lmplot(x=continuous_col, y= continuous_col, hue=categorical_col, data=, row=categorical_col, col=categorical_col, col_wrap=3)` No `kind` here

# 07 Dexplot
`import dexplot as dxp`
* `aggplot` | `jointplot` | `heatmap`


* **`aggplot`**: similar to `catplot`
    * `dxp.aggplot(agg='continuous_col', groupby='categorical_col', data=, hue='categorical_col')` Other parameters are `figsize=(12,8)` and make **stacked bar plot** by adding `stacked=True`
    * By default `aggfunc='mean'` we can change that by passing other func `aggfunc='max'` / 'median'
    * `orient='h'` horizontal orientation
    *  `dxp.aggplot(agg='categorical_col', data=)` will plot a **countplot** or value count plot. We can add `hue='categorical_col`. And `normalize=all` or any of the `agg` or `hue` col
    * we can also plot other than `bar` plot: `line` | `box` | `hist`
        * `dxp.aggplot(agg='continuous', groupby='categorical', data=, hue='categorical', kind='line', aggfunc='median')` `orient='h'` can be an add on | `kind='box'` 
        * `dxp.aggplot(agg='continuous', groupby='categorical', data=, kind='hist', orient='v')` OR `kind='kde'`
        * Use both `rows` and `cols`:
            * `dxp.aggplot(agg='continuous', groupby='categorical', data=, kind='hist', row='categorical', col='categorical')` OR we can use `normalize` with `kind='bar'`
            
            
* **`jointplot`**: plots raw data. No aggregation of data in this
    * `dxp.jointplot(continuous1, continuous2, data=, hue=categorical1, row=categorical2, col=categorical3)`
    * Add `fit_reg=True` to add a regression line
    * Default `kind=scatter`. We can also use `kind=line` eg- `dxp.jointplot(date, continuous2, data=, hue=categorical1, row=categorical2, col=categorical3, kind='line')`
    

            
* **`heatmap`**: `dxp.heatmap(categorical1, categorical2, data=, annot=True)`
    * **Aggregate and heatmap**: `dxp.heatmap(x=categorical1, y=categorical2, agg=continuous1, aggfunc='max', data=, annot=True)`
        * We can also **normalize** by either x or y i.e. a categorical column `normalize=categorical1`

**<span  style="color:green; font-size:40px">10. EDA </span>**

# 01. Data Taxonomy
* variables
    * Continuous
    * Categorical
        * Ordinal
        * Nominal
        
        
        
* Convert a faeture/variable into categorical: 
    * `cat = pd.Categorical(df['col'])` |  `cat = df['col'].astype('category')` 
    * If we want an ordered category: 
        * If order in the `'col'` would be `order=['fair', 'good', 'very good']`
        * `cat = pd.Categorical(df['col'], categories=order, ordered=True)`
        * **Binning a numerical feature leads to ordered category**
            * `pd.cut(df['col'], 5)` | `pd.cut(df['col'], [0,2,4,6])`
            * `pd.qcut(df['col'], 5)` quantile cut leads to bins such that each contains equal no. of observation
    * Conversion of categorical feature to categorical (rather than keeping them as object datatype) saves a lot of space. Conversion to a Categorical feature will map a unique element in the category to an integer and assign that integer every time it encounters that element. Where as the object column will store the 'string' everytime.
    * we can access following attributes- `cat.describe()` gives counts and frequency


# 02. Data Dictionary

* Creating data dictionary with col_name, col_description, df_dtypes, col_nunique

* Label each column as continuous, ordinal, nominal

* Rearrage column order: 
    * atfirst strring column that describves the row 
    * Left to right should be more important to less important
    * Group similar columns together
    
* Add missing values per column in data dictionary

**<span  style="color:RED; font-size:22px"> Step wise progress of how to systematically analyze a data upto EDA </span>**

# 03. Non-Graphical Univariate Analysis

* Steps after one get a data
    * Read in data and data-dictionary
    * Tidy the data
    * Update data dictionary with `dtype` | `nuniuqe` | `[continuous, ordinal, norminal]` | `missing values`
    * Rearrange data column orders



| Univariate             | Graphical                               | Non-Graphical                     | 
|-------------|-----------------------------------------|-----------------------------------|
| Categorical | Bar char of frequencies (count/percent) | `value_counts` (count/percent) |
| Continuous  | Histogram/KDE, box/violin  | central tendency -mean/median/mode, variance, std, skew, IQR  |

| Multivariate            | Graphical                               | Non-Graphical                     | 
|-------------|-----------------------------------------|-----------------------------------|
| Categorical vs Categorical | heat map, mosaic plot | Cross tabulation (count/percent) |
| Continuous vs Continuous  | all pairwise scatterplots, kde, heatmaps |  all pairwise correlation/regression   |
| Categorical vs Continuous  | All seaborn "categorical" plots | Summary statistics for each group |

* **Non-GRAPHICAL Univariate analysis for continuous variables:**
    * `df.descrtibe()` will include the central tendencies
    
    
* **Non-GRAPHICAL Univariate analysis for categorical variables:**
    * Ordinal: `df['col'].value_counts()` and then convert them into categorical dtype by
        * `order=['fair', 'good', 'very good']`
        * `cat = pd.Categorical(df['col'], categories=order, ordered=True)`
    * Nominal:  `df['col'].value_counts()` we can convert them in categorical dtype but without order

# 04. Graphical Univariate Analysis

* **GRAPHICAL Univariate analysis for continuous variables:**
    * `df.descrtibe()` 
    * **Distribution** : We can plot histogram or kde plot separately with matplotlib. Seaborn plots them together `sns.distplot(df['numeric_col])` we can make only 1 show by `kde=False` or `hist=False`. We can also use **violin plot**
    * **Outliers**: Check by boxplot `sns.boxplot(df['numeric_col'], data=)` 
    * Consider binning continuous variable into a categorical column



* **GRAPHICAL Univariate analysis for categorical variables:**
    * `value_counts()` and plot
    * `sns.countplot()`

# 05. Outliers and Duplicates
**Numerical columns**
* Follow Soledad's way of finding Outliers and how to find out upper and lower boundary
* Follow Ted's way of making a boolean series of which rows contain an outlier 
    * explore outlier rows to check for a trend
* Plot entire df boxplot to visually see outliers: 
    * `df.plot(kind='box', subplots=True, figsize=(8,6), layout=(2,4))`
* Once found explore the outlier rows and see if there is something abnormal
* We can capture **Outlierness** by creating a 1/0 column

**Categorical columns**
* Look into low frequency elements in a category. Aggregate all low frequency elements into one `'others'` to reduce cardinality ~ Soldedad

# 06. Duplicates
* **Finding duplicate rows**
    * `filt=df.duplicated(keep=False)` then `df[filt]` to see duplicate rows
    * Drop duplicate rows by `df.drop_duplicates()`
    
* **Finding duplicate columns**
    * `filt=df.T.duplicated(keep=False)` then `df.T[filt]` to see duplicate columns
    * Follow Soledad's method to remove constant and quasi-constant values

# 07. Multivariate Categorical vs Categorical

* **Non-GRAPHICAL categorical vs categorical:**
    * `df.pivot_table(index='categorical_col1', columns='categorical_col2', aggfunc='size')`
    
    
* **GRAPHICAL categorical vs categorical:**
    * `sns.countplot(x='categorical_col1', hue='categorical_col2', data=)`
    * This is same as `df.pivot_table(index='categorical_col1', columns='categorical_col2', aggfunc='size').plot(kind='bar')`
    * Or we can do heatmaps
        * `cat_abc=df.pivot_table(index='categorical_col1', columns='categorical_col2', aggfunc='size')`
        * `sns.heatmap(cat_abc)`


| Univariate             | Graphical                               | Non-Graphical                     | 
|-------------|-----------------------------------------|-----------------------------------|
| Categorical | Bar char of frequencies (count/percent) | `value_counts` (count/percent) |
| Continuous  | Histogram/KDE, box/violin  | central tendency -mean/median/mode, variance, std, skew, IQR  |

| Multivariate            | Graphical                               | Non-Graphical                     | 
|-------------|-----------------------------------------|-----------------------------------|
| Categorical vs Categorical | heat map, mosaic plot | Cross tabulation (count/percent) |
| Continuous vs Continuous  | all pairwise scatterplots, kde, heatmaps |  all pairwise correlation/regression   |
| Categorical vs Continuous  | All seaborn "categorical" plots | Summary statistics for each group |

# 08. Multivariate Categorical vs Continuous
* Mean value of the continous variable w.r.t each categorical element
    * `sns.barplot('categorical_Col', 'continuous_col', data=)`
    * Add 1 more dimension (categorical) by `hue` or `row` or `col`
        * `sns.catplot(x='categorical_l', y='continuous_col', data=, kind-='bar', row='categorical_2', col='categorical_3')`
    * Heatmap-2 categorical and 1 continuous column
        * `pt = df.pivot_table(index='categorical_1', columns='categorical_2', values='continuous')`
        * `sns.heatmap(pt)`

# 09. Multivariate Continuous vs Continuous
* Pairwise scatter plot
* clustermap to find cluster of similar features in the data
    * `sns.clustermap(df.corr())`
* When 2 continuous variable, go to plot is scatter plot
    * Also, `sns.lmplot('continuous_col','continuous_col', data=)`
    * `sns.lmplot('continuous_col','continuous_col', data=, hue='categorical_1', fit_reg=False)`


# 10. Binning a continuous variable

* **Binning a numerical feature leads to ordered category**
    * `pd.cut(df['col'], 5)` | `pd.cut(df['col'], [0,2,4,6])`
    * `pd.qcut(df['col'], 5)` quantile cut leads to bins such that each contains equal no. of observation