# Python Part 3:  Pandas Example

### This notebook runs through an example using Pandas with the `jeopardy.csv` data.

**Data Source:**  200K+Jeopardy questions from [Reddit](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/)

---


## Part 1:  import libraries, check versions, set up preferences

In [1]:
# Python 2 & 3 Compatibility
from __future__ import print_function, division

In [None]:
# imports a library 'pandas', names it as 'pd'
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from IPython.display import Image

# enables inline plots, without it plots don't show up in the notebook
%matplotlib inline

In [None]:
# check version of libraries
print("Pandas version:",pd.__version__)
print("Numpy version:",np.__version__)

In [None]:
# confirming which version of Python I am using
import sys
print("Python Version:", sys.version)

In [None]:
# set various options in pandas
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 25)
pd.set_option('display.precision', 3)

In [None]:
# check size of file; notice this is a bash command -- I can run it in the notebook!
!ls -l

## Part 2:  read in the data

In [None]:
# read csv data into pandas dataframe
df_orig = pd.read_csv('jeopardy.csv', encoding="ISO-8859-1")

# Note: I normally don't need to specify encoding.  
# But, when I read in this csv file, there was an error.  
# I googled it; there are some different unicode types.  I tried a bunch and this worked.  

# Data formatting is unpredictable, and one of the skills in data science is to 'google' 
# and see how to work through data issues

## Part 3:  look at data
`df` = dataframe being used.  In our case, it is `df_orig`
```python
df.shape()
df.info()
df.head()
df.tail()
df.columns
df.values
df.dtype
```

In [None]:
# check dimensions of dataframe
# (rows, colums)
df_orig.shape

In [None]:
df_orig.head(3)

## Part 4:  summarize data
```python
df.describe()
```

In [None]:
df_orig.describe()

In [None]:
df_orig['Round'].describe()

In [None]:
# print the unique values of the column 'Round'
df_orig['Round'].unique()

In [None]:
df_orig.groupby('Round').count()

## Part 5:  create a new column

In [None]:
df_orig['Dollar_Amt'] = df_orig['Value']

## Part 6:  clean data

In [None]:
# remove the '$' in the column Dollar_Amt
df_orig['Dollar_Amt'].replace(regex=True,inplace=True,to_replace=r'\$',value=r'')

df_orig.head(3)

In [None]:
# let's look at the tail end of the data
df_orig.tail(5)

In [None]:
# we need to do more cleaning.  There is a comma in the dollar amount
# remove the commas in the column 'Dollar_Amt'
# there are many ways to do it.  here's one:
df_orig['Dollar_Amt'] = df_orig['Dollar_Amt'].str.replace(',', '')
df_orig.tail(2)

In [None]:
# create a new column which we want to be numeric
df_orig['Dollar_Amt_n'] = df_orig['Dollar_Amt']

In [None]:
# check data types
df_orig.dtypes

In [None]:
df_orig['Dollar_Amt_n'].describe()

In [None]:
df_orig['Dollar_Amt_n'].unique()

## Part 7:  change data type

In [None]:
df_orig['Dollar_Amt_n'] = pd.to_numeric(df_orig['Dollar_Amt_n'], errors='coerce')

In [None]:
df_orig['Dollar_Amt_n'].unique()

In [None]:
# check data types
df_orig.dtypes

# notice Dollar_Amt_n is now type float64

In [None]:
# notice now we see summary statistics (rather than frequency counts for string data)
df_orig['Dollar_Amt_n'].describe()

## Part 8:  visualize data

In [None]:
# do barplot of a categorical variable
df_orig['Round'].value_counts().plot('barh')

In [None]:
# do barplot of a numerical variable
fig = plt.figure(figsize=(12,5))

df_orig['Dollar_Amt_n'].value_counts().plot('bar')

In [None]:
# do barplot of a categorical variable
fig = plt.figure(figsize=(12,5))

df_orig['Value'].value_counts().plot('bar')