<a href="https://colab.research.google.com/github/MonkeyWrenchGang/MGTPython/blob/main/module_2/2_introduction_to_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Pandas 
---

Pandas is a library in Python that is used for data manipulation and analysis. It is a powerful tool that provides easy-to-use data structures and data analysis tools for handling and manipulating large amounts of data. You can think of it as a spreadsheet with rows and columns that you can manipulate and analyze using Python.


In this tutoral we'll cover the following Pandas basics

1. Reading CSV & Excel Files 
2. Pandas Properties 
3. Data Types 
    - converting data types 
4. Filtering Columns 
5. Filtering Rows 
6. Filtering Rows using Queries



## Load Libraries

In [None]:
# -- basic stuff for your notebook -- 
from IPython.core.display import display, HTML
from IPython.display import clear_output
display(HTML("<style>.container { width:90% }</style>"))
import warnings
warnings.filterwarnings('ignore')
# ------------------------------------------------------------------

# -- core packages --
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# -- need this to render charts in notebook -- 
%matplotlib inline

## 1. Reading CSV & Excel Files 

Here we are going to read the following CSV and Excel files into data frames, we'll also **eyeball** the data 

- appl.csv    : a comma delimited file of dailiy prices for Apple
- msft.xlsx   : an excel file of daily prices for Microsoft 
- churn.csv   : a comma delimited file of telco churn 
- titanic.csv : a comma delimited file of titanic survivors and not-survivor 

to do this we'll use the `dataframe = read_*("file location")` template then we'll check our data with `dataframe.head()`

```python
aapl = pd.read_csv("data/aapl.csv")
aapl.head()

msft = pd.read_excel("data/msft.xlsx")
msft.head()

churn = pd.read_csv("data/churn.csv")
churn.head()

titanic = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
titanic.head()

```

## 2. Pandas Properties 

https://pandas.pydata.org/pandas-docs/stable/reference/frame.html

- index: returns a sequence used for indexing and alignment. index (row labels) of the dataframe 
- shape: returns the number of rows and columns of a data frame. 
- columns: returns the column names of the data frame. 
- dtypes: returns the data types of each column. 
- info(): returns a full summary of each 

the template is `dataframe.<property or method> `

```python
# -- example calls -- 
df.index
df.shape
df.columns 
df.dtypes 
df.info()

msft.index
msft.shape
msft.columns
msft.info()
```

## 3. Data Types aka dtypes 

The dtypes property will show you the data types that pandas infered from the data. If pandas can't infer the data type from the data it will be stored as an object data type, basically treating it like a string. So things like $100.10 will often be imported as an object not as float to convert it will require some cleaning.  

Compoare the dtypes of **mstf** and **aapl**. You'll notice Pandas attempt infer datatypes can be different. why does `msft's` **Date** have `datetime64[ns]` while `aapl` has **object**?

```python
msft.dtypes
aapl.dtypes
```

https://pbpython.com/pandas_dtypes.html



### Converting data types 

How can we fix aapl's Date? The simplest thing to do is simply get it right on import. To do this we can tell python which date we want to "parse as date" 

```python

# -- import with parse date function -- 
aapl = pd.read_csv("data/aapl.csv", parse_dates=['Date'])
aapl.dtypes

```

Alternatively, we can assign the datatypes after the fact. 

```python

# -- import then convert -- 
aapl = pd.read_csv("data/aapl.csv")
aapl["Date"] = pd.to_datetime(aapl["Date"])
aapl.dtypes

```



## 4. Filtering Columns 

Bracket Bracket! to filter or subset the columns in a dataframe we use `df[[<columns separted by comma >]]` here suppose we want to subset the titanic and churn data sets

```python
# -- PassengerId, Survived, Name, Age -- 
titanic_res1 = titanic[["PassengerId", "Survived", "Name", "Age"]]
titanic_res1.head()

# -- same thing just filter from a list -- 
column_list = ["PassengerId", "Survived", "Name", "Age"]
titanic_res2 = titanic[column_list]
titanic_res2.head()


# -- State, Account Length, Area Code, Phone -- 
churn_res1 = churn[["State", "Account Length", "Area Code", "Phone"]]
churn_res1.head()


```




## 5. Filtering Rows

Pandas supports the common operator stuff  
- equal to: `df[df[”column”] == “some_value”]`
- not equal to: `df[df[”column”] != “some_value”]`
- greater than: `df[df[”column”] > some_number ]`  **( ==, !=, >, >=, <, <=)**
- In a list `df[df[”column”].isin(["a list", "separated by comma"])]` 
- use the squiggly **`“~”`** for not in a list,  `df[~df[”column”].isin(a_list)]`
- find nulls use **.isna()** `df[df["column"].isna()]`
- find not null use **.notnull()** `df[df["column"].notnull()]`

**Multiple-Conditions (&|)**
- ampersand **&** operator for AND-ing 
- pipe **|** operator for OR-ing 
- Parenthesis to handle order of operatioins 

Here is an example of AND and OR 

- AND  : `df[ (df[“column1” != “value”) & (df[“column2”] > some_nbr)]`
- OR   : `df[ (df[“column1” != “value”) | (df[“column2”] > some_nbr)]`

Lets make some results by select titanic passengers who meet these criteria 

1. passengers Age older than 50
2. passengers Age younger than 40 in Pclass 1 
3. passengers who are male and are over 20 Age
4. passengers who **did not** embark C or Q and are older than 30
5. passengers who female and are under 20 Age

```python
# res1
res1 = titanic[titanic["Age"] > 50]
res1["Survived"].value_counts()

# res2
res2 = titanic[(titanic["Age"] < 40) & (titanic["Pclass"] == 1)]
res2["Survived"].value_counts()

# res3 
res3 = titanic[(titanic["Sex"] == "male") & (titanic["Age"] > 20)]
res3["Survived"].value_counts()

# res4 
res4 = titanic[(~titanic["Embarked"].isin(["C", "Q"])) & (titanic["Age"] > 30)] 
res4["Survived"].value_counts()

# res5 
res5 = titanic[(titanic["Sex"] == "female") & (titanic["Age"] > 20)]
res5["Survived"].value_counts()

```


In [None]:
# < insert res 1 here >

## 6.Queries

An alternative to filtering rows via the dataframe is Dataframe.query(). the .query() method is supposed to be somewhat of a similificatoin for conditional and multi-conditional filtering logic, but like everything opensource it has slight syntax variations. For those familar with SQL it may be more intuative. 

Query supports the common operator stuff but now you are writing an EXPRESSION 

`df.query('expression1 and expression2')` notice the single quote surrounds the expressions!  

- equal to: `df.query('column == “some_value”')`
- not equal to: `df.query('column != “some_value”')`
- greater than: `df.query('column > some_value')`  ( ==, !=, >, >=, <, <=)
- In a list: `df.query('column == ["list","value"]')` 
- Not in List: `df.query('column != ["list","value"]')` 
- find nulls use  `df.query('column.isnull()', engine='python' )` note the engine=python command. 
    - common alternative `df.query('column != column', engine='python' )`
- find not null use  `df.query('column.notnull()', engine='python' )` note the engine=python command. 
    - common alternative `df.query('column == column', engine='python' )`

with query you can connect Multiple-Conditions with the word **and** and **or**

`df.query('column != "value1" and column2 > 50', engine='python' )`

let's use query to ansewr the following questions about churn

1. State equals VA
2. State in VA, TX, TN and Account Length > 100 
3. Day Mins > 100 and Intl Calls > 2 
4. VMail Plan == yes and Int'l Plan == no
5. (State in VA, TX, TN and Night Mins > 200) or Account Length > 100 

You'll notice a challenge with the poorly formated column names you have to use backticks to quote names with Spaces or starting with numbers and escape (\) columns containing single quotes or any other special character. 


```python
# query 1 
q1 = churn.query('State == "VA")
q1["Churn?"].value_counts()

# query 2
q2 = churn.query('State == ["VA", "TX", "TN"] and `Account Length` > 100')
q2["Churn?"].value_counts()
                 
# query 3
q3 = churn.query('`Day Mins` > 100 and `Intl Calls` > 2 ')
q3["Churn?"].value_counts()         
                 
# query 4
q4 = churn.query('`VMail Plan` == "yes" and `Int\'l Plan` == "no" ')
q4["Churn?"].value_counts() 

# query 5
q5 = churn.query('(State == ["VA", "TX", "TN"] and `Night Mins` > 200) or `Account Length` > 100')
q5["Churn?"].value_counts()
                 
```



## 7. How can i filter both Rows and Columns? 

Now you know how to filter rows and how to filter columns individually you can put them together like this: 

Dataframe based filtering is simple as adding a *bracket bracket* after the row filter.  

```python
rowcol1 = titanic[titanic["Age"] > 50][["PassengerId", 
                                        "Survived", 
                                        "Name", 
                                        "Age"]]
print(rowcol1["Survived"].value_counts())
rowcol1

```

Query based filtering is as simple as adding a *bracket bracket* after the query as well. 

```python
rowcol2 = churn.query('State == ["VA", "TX", "TN"] \
                       and `Account Length` > 100')[["State", 
                                                     "Account Length",
                                                     "Churn?",
                                                     "Area Code", 
                                                     "Phone"]]
print(rowcol2["Churn?"].value_counts())

rowcol2

```
