<a href="https://colab.research.google.com/github/alblaine/Data-Cleaning-with-Python/blob/master/Data_Cleaning_with_Python_Activity_2019.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Cleaning with Python**  
An NCSU Libraries Workshop  
Instructor: Alison Blaine, ablaine@ncsu.edu



### Welcome! In this workshop, we'll learn how to do the following: 
* load in python libraries for data cleaning (pandas) and graphing (matplotlib)
* read csv files into Python 
* examine the first and last few rows of the data
* delete duplicates
* filter the data to create subsets
* sort the data
* group the data for plotting
* drop variables from the dataset
* create new variables
* generate and save summary statistics for a dataset

### Step 1. We'll start by loading in the required Python libraries.

In [0]:
import pandas as pd
import matplotlib.pyplot as plt

### Step 2. Next, we'll load in population.csv. First, download the data file to your laptop from [go.ncsu.edu/popdata](https://go.ncsu.edu/popdata). This dataset is a CSV (comma separated values) file.

In [0]:
from google.colab import files
files.upload()

Saving population.csv to population.csv


{'population.csv': b'state/region,ages,year,population\nAL,under18,2012,1117489\nAL,total,2012,4817528\nAL,under18,2010,1130966\nAL,total,2010,4785570\nAL,under18,2011,1125763\nAL,total,2011,4801627\nAL,total,2009,4757938\nAL,under18,2009,1134192\nAL,under18,2013,1111481\nAL,total,2013,4833722\nAL,total,2007,4672840\nAL,under18,2007,1132296\nAL,total,2008,4718206\nAL,under18,2008,1134927\nAL,total,2005,4569805\nAL,under18,2005,1117229\nAL,total,2006,4628981\nAL,under18,2006,1126798\nAL,total,2004,4530729\nAL,under18,2004,1113662\nAL,total,2003,4503491\nAL,under18,2003,1113083\nAL,total,2001,4467634\nAL,under18,2001,1120409\nAL,total,2002,4480089\nAL,under18,2002,1116590\nAL,under18,1999,1121287\nAL,total,1999,4430141\nAL,total,2000,4452173\nAL,under18,2000,1122273\nAL,total,1998,4404701\nAL,under18,1998,1118252\nAL,under18,1997,1122893\nAL,total,1997,4367935\nAL,total,1996,4331103\nAL,total,1995,4296800\nAL,under18,1995,1110553\nAL,under18,1996,1112092\nAL,total,1994,4260229\nAL,total,

In [0]:
dat=pd.read_csv('population.csv')

### Step 3. Type dat.head() to see the first 5 rows of the dataset. Then click the run button.

### Step 4. Rename the state/region column to state_region using rename().

### Step 5. Print out the unique values in the state_region column using drop_duplicates().

### Step 6. Another way to see unique values in a column is to use the unique() function. This prints out as an array of unique values.

### Step 7. To count the number of unique values in a column, use the len() function as a wrapper to the unique() function.

### Step 8. Use value_counts() to count the number of observations of each unique value for state_region.

### Step 9. Filter out all rows except for NC using query().

In [0]:
dat_filtered = 

### Step 10. Order the columns by year using sort_values().

In [0]:

dat_sorted = 

### Step 11. Let's see what the mean population is for each category of ages (under18 and total)

In [0]:
age_means = 

### Step 12.  Since we don't need the year column, we can delete it.

In [0]:
age_means = 

###Step 13. Get the population value for under18 from age_means.

### Step 14. Read in the drivers.csv dataset from a url.

In [0]:
url="https://raw.githubusercontent.com/alblaine/data-1/master/bad-drivers/bad-drivers.csv"

drivers = pd.read_csv(url)

### Step 15. Get info about the drivers dataset using .info().

### Step 16. Look at the data using the head() command.

### Step 17. Rename the columns to shorter names using rename()

In [0]:
drivers_renamed = drivers.rename(columns={'Number of drivers involved in fatal collisions per billion miles':'num_per_billion',
       'Percentage Of Drivers Involved In Fatal Collisions Who Were Speeding':'pct_speeding',
       'Percentage Of Drivers Involved In Fatal Collisions Who Were Alcohol-Impaired':'pct_alcohol_impaired',
       'Percentage Of Drivers Involved In Fatal Collisions Who Were Not Distracted':'pct_not_distracted',
       'Percentage Of Drivers Involved In Fatal Collisions Who Had Not Been Involved In Any Previous Accidents':'pct_no_accidents',
       'Car Insurance Premiums ($)':'premiums',
       'Losses incurred by insurance companies for collisions per insured driver ($)':'losses'})
                       
                                 

In [0]:
drivers_renamed.head()


### Step 18. Create a new dataset keeping only the state, premiums, and losses columns using the .filter() function.

In [0]:
drivers_pl = 

### Step 19. Create a new variable, ratio, that is the ratio of losses to premiums (ratio= losses/premiums)

In [0]:
drivers_pl = 

### Step 20. Create a scatter plot showing premiums = X, losses = Y.

### Step 21. Get summary statistics about the drivers_pl dataset.

### Step 22. Save the summary statistics into their own dataset.

### Step 23. Access the mean of the premiums column.

In [0]:
# this way uses the row index label

In [0]:
# this way uses the row position

### Step 24. Transpose columns and rows in dpl_summary using transpose()

### Step 25. Practice. Read in a new dataset from the url provided using the read_csv() command. Name the dataset 'exdat'.

In [0]:
url = "https://raw.githubusercontent.com/alblaine/exchange-rates/master/data/annual.csv"

exdat =  pd.read_csv(url)

exdat.head()

### Step 26. Create a year column based on the Date column (ex: 1971)

Note that **.assign** creates a new column, **pd.DatetimeIndex()** converts the Date column into a Date object in Python, and **.year** extracts the year value.

In [0]:
# See how to convert a date to year using DatatimeIndex()
pd.DatetimeIndex(exdat['Date']).year



Int64Index([1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980,
            ...
            2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017],
           dtype='int64', name='Date', length=825)

In [0]:
# Now do it on this dataset and save as exdat

exdat = exdat.assign(year=pd.DatetimeIndex(exdat['Date']).year)  # see the Pandas docs here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DatetimeIndex.html
exdat

### Step. 27. Delete the Date column using drop().

### Step 28. Practice. Filter the data to only include China and Mexico using the query() function.

In [0]:
exdat_filtered = 

### Step 29. Practice. Rename the "Exchange rate" column to "rate" using rename(). 

In [0]:
exdat_filtered = 