# Module 4 - Data Structure and Pandas Dataframe in Python

* **Instructor**: Ronnie (Saerom) Lee 
* **E-mail**: saerom@umich.edu (for any questions)
* **Date**: May 15th (Tuesday), 2018

## \*** Import relevant packages
> **Q. How do you import packages?**
* **import**
    - Call a package of useful tools and functions
* **import** (package_name) **as** (abbreviation) 
    - Call a package and abbreviate its name
* **from** (package_name) **import** (class or function)
    - Call a specific class or function in the package
* **from** (package_name) **import** (class or function) **as** (abbreviation)
    - Call a specific class or function from a package and abbreviate its name

In [None]:
import numpy as np
import pandas as pd

from IPython.core.display import HTML
from IPython.display import Image as img

# This makes it so that plots show up here in the notebook.
# You do not need it if you are not using a notebook.
%matplotlib inline

## \*** To motivate this session

In [None]:
img(url='https://i.pinimg.com/736x/ca/47/94/ca4794cfada458717c7aa99093a1f425--computer-humor-computer-science.jpg')

> In many (if not, most) cases, errors occur because you specified **the wrong data structure or its parameters**!

## \*** Overview of this session's topics
#### 1. 'General' data structure
* Primitive data structures: int, float, str, bool
* Non-primitive data structures: list, tuple, set, dict

#### 2. Pandas' data structure: DataFrame (and its manipulation)

#### 3. Application using MLB data






## 1. 'General' data structures
### 1.1. Primitive data structures
- (1) Integer: int()
- (2) Float: float()
- (3) String: str() or ""
- (4) Boolean: True or False

### 1.2. Non-primitive data structures
- (1) List: list() or []
- (2) Tuples: tuple() or ()
- (3) Set: set() or {}
- (4) Dictionary: dict()

> _Note_. There are many other structures specific to each packages (e.g., numpy array, pandas dataframe)

> **Examples.**
- Primitive data structures

In [None]:
# Integer
a = 3
print(a)
type(a)

In [None]:
# Float
b = 3.
print(b)
type(b)

In [None]:
# String
c = "Go Blue!"
print(c)
type(c)

In [None]:
# Boolean
d = False
print(d)
type(d)

> - Non-primitive data structures

In [None]:
# List
e = [1, 2, 2]
print(e)
type(e)

In [None]:
# Tuple
f = (1, 2)
print(f)
type(f)

In [None]:
# Set
g = {2, 1}
print(g)
type(g)

In [None]:
# Dictionary
h = {'Donald': 'Male', 'Melania': 'Female'}
print(h)
type(h)

> **Q. (5 mins)**
- (E1) Change *a* into a float and *b* into an integer
- (E2) Divide *a* by *b* and check the type (Note that *a* is an integer and *b* is a float)
- (E3) Get *'Go Blue!Go Blue!Go Blue!'* using a and c (DON'T type this yourself!)
- (E4) Get *'Go Blue!3'* using a and c
- (E5) Check whether *a* is larger than *b*. Save the result as *t* and see whether t is the same as d.
- (E6) Change *e* into a set and check what has changed.
- (E7) Make a list, a tuple, and a set with the following inputs: *6, "Hello", [3, 4, 5]*

In [None]:
# Write your answers below
# E1


# E2


# E3


# E4


# E5


# E6


# E7


### 2-2-1. Changing non-primitive data structures
- List

In [None]:
# To call an item in a specific position in a list
e[1]

In [None]:
# To call an item in a range of positions in a list
e[1:3]

In [None]:
# To change an item in a specific position in a list
e[1] = 8
print(e)

In [None]:
# To add an item to the end of a list
e.append(3)
print(e)

In [None]:
# To add an item to a specific position
e.insert(0, 3)
print(e)

In [None]:
# To remove an item (*** THE FIRST OCCURENCE)
e.remove(3)
print(e)

In [None]:
# To sort the items in ascending order
e.sort()
print(e)

In [None]:
# To sort the items in descending order
e.sort(reverse=True)
print(e)

In [None]:
# List-in-list
x = [1, [2, 3], 4, 5]
x

- Tuple

In [None]:
# To call an item in a tuple
f[0]

> **Q. Try to change the value 2 to 4 in *f* **

In [None]:
print(f)
# Write your answer below


- Set

> **Q. Try to call the first item in set g **

In [None]:
print(g)
# Write your answer below


In [None]:
# To add an item to a set
g.add(3)
print(g)

In [None]:
# To remove an item in a set
g.remove(3)
print(g)

- Dictionary

> **Q. Try to call the first item in dictionary h**

In [None]:
print(h)
# Write your answer below


In [None]:
# To see the keys
h.keys()

In [None]:
# To see the values of keys
h.values()

In [None]:
# To call the value of a key
h['Donald']

In [None]:
# To add a key and its value
h['Barack'] = 'Male'
print(h)

In [None]:
# To remove a key and its value
h.pop('Barack', None)
print(h)

In [None]:
# Dict-in-dict
y = {'Ronnie': {'Age': 20, 'Monthly income ($)': 0, 'Gender': 'M', 'Friends': None},
    'Mana': {'Age': 28, 'Monthly income ($)': 1000000, 'Gender': 'F', 'Friends': ('Teddy', 'Jeff', 'Abby', 'Patrick'),  'Pets': ['1 dog', '1 cat']}}
y

# *** BREAK TIME FOR 5 MINS! ***

In [None]:
img(url='https://i.pinimg.com/originals/63/da/8b/63da8b38a240b71f8dfbe6b0b9b18036.png')

## 2. Pandas' data structure: DataFrame (and its manipulation)
- **pandas**: an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

> **Q. What is a pandas dataframe?**
- In a nutshell, **an Excel sheet**!

> **Q. Why not use Excel then?**
- Excel is **SUPER**-slow and very memory-intense---i.e., impossible to handle big data!
- Pandas (along with other Python packages) allows you to efficiently handle big data!

In [None]:
img(url='https://www.askideas.com/media/30/Cause-OF-Death-Microsoft-Excel-Funny-Image.jpg')

### 2.1. How to create, save, and read a dataframe

#### (1) Create a dataframe

In [None]:
# data = dictionary{key: value}
data = {'Course'  : 'Intro to Big Data', # Strings
        'Section' : 4, # Int
        'Names'   : ['Abigail', 'Jeff', 'Mana', 'Patrick'], # List
        'Group'   : ['1'] * 2 + ['2'] * 2,
        'Date'    : pd.Timestamp.today().date()} # Time stamps

df1 = pd.DataFrame(data)
df1

> **Q.** Create the following dataframe using pandas and save it as a variable 'df'.

Name      | Depart
:--------:| -----:
  Abigail |  CLASP 
     Jeff |  SOCIO 
     Mana |  STRAT 
  Patrick |     MO 
   Ronnie |  STRAT 
    Teddy |     MO 


In [None]:
# Write your answer below


#### (2) Save the dataframe into a file (Note. We will learn how to save first, since we don't have a file to read yet!)
* csv/tsv/txt file

> *Note*. Don't forget to specify the **separator**!

In [None]:
df.to_csv('data.csv', sep = ',',  index = False) # if comma separated (csv)
df.to_csv('data.tsv', sep = '\t', index = False) # if tab separated (tsv)
df.to_csv('data.txt', sep = '\t', index = False) # you can also use sep = ',' as in csv files

> *Note*. Better to use **TSV** than CSV!
- Why? There may be commas(,) in the values (e.g., "Lee, Ronnie")

* Excel file

In [None]:
df.to_excel('data.xlsx', index_label='label')

- Other formats you can save to
    - **LaTeX**: df.to_latex()
    - **JSON**: df.to_json()
    - **Stata**: df.to_stata(name='file_name')
    - **SQL**: df.to_sql(name='file_name', con='engine_name')
    - and many more.

#### (3) Read a file into a dataframe
* csv/tsv/txt file

> Note: Don't forget to specify the **separator**!

In [None]:
df_csv = pd.read_csv('data.csv', sep = ',')
df_csv

In [None]:
df_tsv = pd.read_csv('data.tsv', sep = '\t')
df_tsv

* Excel file

In [None]:
with pd.ExcelFile('data.xlsx') as xlsx:
    df_excel = pd.read_excel(xlsx, sheet_name = 'Sheet1')
df_excel

### If there are multiple sheets to read from
# with pd.ExcelFile('data.xlsx') as xlsx:
#    df_sheet1 = pd.read_excel(xlsx, sheet_name = 'Sheet1')
#    df_sheet2 = pd.read_excel(xlsx, sheet_name = 'Sheet2')

* Other formats you can read
    - **Stata**: pd.read_stata()     
    - **SAS**: pd.read_sas()         
    - **SQL**: pd.read_sql_table()   
    - **JSON**: pd.read_json()      
        - Especially useful when using **API**'s (tomorrow's session)
    - **HTML**: pd.read_html()       
        - Especially useful when scraping html (tomorrow's session)
    - and many more.

### 2.2. How to rename, add, and remove row/column(s) in the dateframe


#### (1) Rename a column

In [None]:
df1 = df1.rename(index=str, columns={"Names": "Name"})
df1

#### (2) Add row/column(s)
* Rows: *.append()*

In [None]:
# First, create a new dataframe
new_data = {'Course'  : 'Intro to Big Data',
            'Section' : 4,
            'Name'    : ['Ronnie', 'Teddy'],
            'Group'   : '5',
            'Date'    : pd.Timestamp.today().date()}

df2 = pd.DataFrame(new_data)
df2

In [None]:
# Append the new dataframe to the existing dataframe
df1 = df1.append(df2, ignore_index=True)
df1

* Or an alternative way to add row(s) is to use *pd.concat()*

In [None]:
df1 = pd.concat([df1, df2], axis = 0, ignore_index = True)    # If axis = 1, then add column
df1

* Columns

In [None]:
df1['Assignment'] = [95, 85, 90, 75, 10, 80, 10, 80]
df1

#### (3) Remove rows, columns, and duplicates
* Rows (by index)

In [None]:
df1.drop(0)

* Columns
> *Note*. axis = 1 denotes that we are referring to a column, not a row

In [None]:
df1 = df1.drop('Date', axis = 1)
df1

* Duplicates

In [None]:
# First, in order to check whether there are any duplicates
df1.duplicated()

In [None]:
# If there are duplicates, then run the following code
df1 = df1.drop_duplicates()
df1

### 2.3. Merge two dataframes

> **Q.** Create a dataframe with the following information:
- Group 1: 
    - Presentation: 80
    - Report: 60
- Group 2:
    - Presentation: 90
    - Report: 80
- Group 3:
    - Presentation: 100
    - Report: 70
- Group 4:
    - Presentation: 50
    - Report: 30

In [None]:
# Write your answer below


In [None]:
# Recall df1
df1

> **Q.** How can we input the grades for each individual's presentation and report without putting in the scores one by one?
- Use **merge**

In [None]:
df1.merge(df3, on='Group')

> **OOPS! We lost Ronnie and Teddy! What went wrong?**


> - **Important parameter**: how = {'left', 'right', 'outer', 'inner'}
    - **inner** (*default*): use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys
    - **outer**: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically
    - **left**: use only keys from left frame, similar to a SQL left outer join; preserve key order
    - **right**: use only keys from right frame, similar to a SQL right outer join; preserve key order
    
> **Q.** Which one of these should we set *how* as?
- Try the other three parameters and find the right one!

In [None]:
# (1) Left


In [None]:
# (2) Right


In [None]:
# (3) Outer


* Thus, the right parameter to merge the two dataframes is

In [None]:
# Write your answer below


> **Q.** Recall that we created dateframe df, which has the information on each student's department. Merge df with df1. Which of the parameters should you use for how?

In [None]:
df
# Write your answer below


### 2.4. Check what's in the dataframe
#### (1) See the top and bottom rows of the dataframe

In [None]:
df1.head() # put the number of rows to see a certain number of rows (e.g., df1.head(2))

In [None]:
df1.tail() # put the number of rows to see a certain number of rows (e.g., df1.tail(2))

#### (2) Display the index, columns, and the underlying data

In [None]:
df1.index

In [None]:
df1.columns

In [None]:
df1.values

#### (3) Sort by values

In [None]:
df1.sort_values(by='Assignment')   # Ascending order

In [None]:
df1.sort_values(by='Assignment', ascending=False)    # Descending order

> **Q.** What would happen if we sort a column which has a missing value (i.e., NaN)?

In [None]:
df1.sort_values(by='Report', ascending=False)    # Descending order

#### (4) Search for a value

In [None]:
df1.where(df1['Assignment'] > 90)

* For specific column

In [None]:
df1['Name'].where(df1['Assignment'] > 90)

> **Q.** How can we count the number of students who got 'Assignment' larger than 90?

In [None]:
df1['Name'].where(df1['Assignment'] > 90).count()

#### (5) Select
* Column(s)

In [None]:
df1['Assignment']

> *Note*. Alternative way of selecting a column is by using a dot(.)

In [None]:
df1.Assignment

* Row(s)

In [None]:
df1[0:3]

* By location

In [None]:
df1.loc[0,'Name']  # .loc[index_num, column_name]

In [None]:
df1.loc[0,['Name','Presentation']]  # .loc[index_num, [column_name1, column_name2, column_name3,...]]

* Using a condition

In [None]:
df1[df1['Assignment'] > 90]

In [None]:
df1[df1['Depart'].isin(['MO', 'SOCIO'])]

> **Q.** Select the rows in which the student(s) is in MO department and has more than 80 points for his/her presentation.

In [None]:
# Write the answer below


### 2.5. Missing data
* Let's first take a look at what we have as our dataframe

In [None]:
df1

#### (1) Check whether there are any missing data

In [None]:
df1.isnull()

> *Note*. If the dataframe is large in dimension, it would be NOT be easy to see whether there are any True's
- An easier way to check is to sum the number of True's

In [None]:
df1.isnull().sum()

#### (2) Managing missing data
- [Option 1] Drop row(s) with missing data

In [None]:
df1.dropna(how='any')

- [Option 2] Drop column(s) with missing data
> *Note*. axis = 1 denotes that we are referring to a column, not a row

In [None]:
df1.dropna(how='any', axis = 1)

- [Option 3] Fill in ALL missing data with a single value

In [None]:
df1.fillna(value = 0)

> **Q.** Fill in missing data in column 'Presentation' with the mean of non-missing data in column 'Presentation'

In [None]:
# Write your answer below


* [Option 4] Fill in a values by location

In [None]:
df1.loc[[4, 5], 'Report'] = 60
df1

### 2.6. Basic statistics

In [None]:
# Recall the dataframe
df1

#### (1) Describe shows a quick statistic summary of your data

In [None]:
df1.describe()

> **Q.** TOO MANY ZEROS! How can we make this more prettier?

In [None]:
df1.describe().round(2)

####  (2) Caculate
* Mean

In [None]:
df1.mean().round(2)

* Median

In [None]:
df1.median().round(2)

* Min/Max

In [None]:
df1['Report'].min().round(2)

In [None]:
df1['Report'].max().round(2)

* Variance

In [None]:
df1.var().round(2)

* Correlation

In [None]:
df1.corr().round(2)

#### (3) Grouping: a process involving one or more of the following steps
* Splitting the data into groups based on some criteria
* Applying a function to each group independently
* Combining the results into a data structure

In [None]:
df1.groupby('Group').mean()

> **Q.** How can we group by 'Group' and 'Depart'?

In [None]:
df1.groupby(['Group', 'Depart']).mean()

#### (4) Pivot tables

In [None]:
df1.pivot_table(values='Report', index=['Depart'], columns=['Group'])

### 2.7. Basic column operations
* Logarithm
    - Natural logarithm: np.log()
    - The base 10 logarithm: np.log10()
    - The base 2 logarithm: np.log2()

In [None]:
df1['log_Report'] = np.log(df1['Report'])
df1

* Square root

In [None]:
df1['sqrt_Report'] = np.sqrt(df1['Report'])
df1

> **Q.** Suppose that the evaluation is based on the following weights
- Assignment: 25%
- Presentation: 35%
- Report: 40%

> Make a new column 'Total' which calculate the weighted sum.
> Show the table in a descending order by 'Total'?

In [None]:
# Write your answers below


> Another way of doing so is to use **apply**.

In [None]:
# Define a function
def total(row):
    return 0.25 * row['Assignment'] + 0.35 * row['Presentation'] + 0.4 * row['Report']

# Apply the function above to each row
df1['total'] = df1.apply(total, axis=1)
df1

### \*** How to define a function


In [None]:
def function_name(input_1, input_2, ...):

    x = do_something_with_inputs

    return x

> - Example

In [None]:
def divide(x, y):
    return x / y

a = 8
b = 4

print(divide(a, b))
print(divide(b, a))
print(divide(x=a, y=b))
print(divide(y=b, x=a))

In [None]:
def power(x, y=2): # Define the default value
    return x ** y

print(power(3))
print(power(3, 3))
print(power(3, 1))

> **Q.** Write a function that take the following three variables

>    - x: a list
>    - y: an integer for the index
>    - z: an integer with a default value of 2

> and checks whether the yth value in list x is divisible by z

In [None]:
# Write your answer below



* Histogram

In [None]:
df1['Total'].hist()

## 3. Application using MLB data

In [None]:
img(url='https://i2.wp.com/wemakescholars.com/blog/wp-content/uploads/2015/03/funny-inspirational-quotes-about-school-inspirational-3-on-funny-quotes.jpg')

* To check whether you **REALLY** understood or not, try the following exercise using the data on the players in the Major League Baseball (MLB).

> **Q.** Read the datafile 'Salaries.csv' (separator = comma) as the variable 'salaries' and show its FIRST 10 rows

> **Q.** Read the datafile 'Batting.xlsx' (sheetname = 'Batting') as the variable 'batting' and show its LAST 5 rows

> **Q.** Create a variable 'data' by LEFT MERGING 'salaries' and 'batting' based on 'player',  'year', and  'team' and show the FIRST 7 rows

> **Q.** Read the STATA datafile 'Pitching.dta' as the variable 'pitching' and show its LAST 4 rows

> **Q.** FOR SIMPLICITY, drop all columns with missing values in pitching

> **Q.** LEFT MERGE 'data' and 'pitching' based on 'player', 'year', and 'team' and show the top 5 rows

> **Q.** Check whether there are any missing values in 'data' 

> **Q.** FOR SIMPLICITY, fill in the missing values with zeros and re-check whether there are any missing values in data

> **Q.** Read the csv files 'Basic.csv' (separator = comma) and INNER MERGE with 'data' based on 'player'

> **Q.** Save the dataframe 'data' as a tsv file 'baseball.tsv' (separator = tab)

> **Q.** Draw the histogram of salary

> **Q.** Since the salary is heavily skewed, create a new column 'log_salary' by putting a natural log on ('salary' + 1) and then draw the histogram of 'log_salary'.

> **Q.** Show summary statistics

> **Q.** Find the TOP 5 birth states (i.e., 'birthState') with the most MLB players.
- *Hint*. Compute size of each group by birthState, save the dataframe as a variable 'state', and sort the values by the sizes in descending order.

> **Q.** Find how many MLB players are from Michigan (i.e., 'birthState' is 'MI').
- *Hint*. Don't print the variable 'state'! Recall how you select a row based on some conditions.

> **Q.** Find the TOP 5 months (i.e., 'birthMonth') in which MLB players were born.
- *Hint*. Compute size of each group by birthMonth, save the dataframe as a variable 'month', and sort the values by the sizes in descending order.

> **Q.** Which year had the highest average salary?
- *Hint*. Compute the mean by year, save the dataframe as a variable 'year', and show the TOP 10 years in terms of mean salary

> **Q.** Which team had the highest average salary?
- *Hint*. Compute the mean by teams, save the dataframe as a variable 'team', and show the TOP 5 teams in terms of mean salary

> **Q.** The above results computed each team's average across all years. Instead, find which team in which year had the highest average salary?
- *Hint*. Compute the mean by 'team' AND 'year', save the dataframe as a variable 'team', and show the TOP 5 teams in terms of mean salary

> **Q.** How did each team's average height changed over time?
- *Hint*. Make a pivot table, save the dataframe as a variable 'height', and show the TOP 5 teams in terms of mean height in 2014.

> **Q.** [TRY GOOGLING] Create dummy variables for 'year'

> **Q.** Concatenate the two dataframes by column

### A quote to close this session.

In [None]:
img(url='https://pics.onsizzle.com/never-let-your-computer-know-that-you-are-in-a-5887996.png')

### If you have any questions, please feel free to send me an email
* Instructor: Ronnie (Saerom) Lee
* E-mail: saerom@umich.edu (for any questions)