# Pandas Advanced

In [None]:
import os

import pandas as pd
import seaborn as sns

In [None]:
crash_df = sns.load_dataset('car_crashes')

### Variable Explanations

| Variable | Explanation |
| --- | --- |
| total | Number of drivers involved in fatal collisions per billion miles |
| speeding | Number of drivers involved in fatal collisions per billion miles Who Were Speeding |
| alcohol | Number of drivers involved in fatal collisions per billion miles Who Were Alcohol-Impaired |
| not_distracted | Number of drivers involved in fatal collisions per billion miles Who Were Not Distracted |
| no_previous | Number of drivers involved in fatal collisions per billion miles Who Had Not Been Involved In Any Previous Accidents |
| ins_premium | Car Insurance Premiums (\$) |
| ins_losses | Losses incurred by insurance companies for collisions per insured driver (\$) | 
| abbrev | State |

https://www.kaggle.com/fivethirtyeight/fivethirtyeight-bad-drivers-dataset

Note that categories are not exclusive. A driver might have been drinking alcohol, while speeding, without having had previous car accidents when entering the statistic as fatal collision.

## Pandas Methods

### 1.1 Let's have a look at the column names of the dataframe. Print out the `columns` attribute.

### 1.2 The meaning of 'abbrev' is not very clear at first. Let us change it to 'US state' instead. Use the rename() method for that. Make sure to use the right argument type, if you need help, consider using the `help` function on `pd.DataFrame.rename`.
### Moreover, most pandas dataframe methods have an `inplace` attribute which is always `False` by default. If you would like to apply a method to the given dataframe instead of reassigning it, set it to `True`.
### Check the columns again after you changed it to US states to make sure it worked.

### 1.3 Use the `.shape` attribute to see what dimensions the dataframe has.

### 1.4 A useful method to get a first impression of the type of data you are working with is `info()`. Apply it to the dataframe.

### 1.5 If you want some summary statistics of your numeric data, use the `describe()` method.

### 1.6 To replace specific values with different values use the `replace()` method. For practice, replace each of the following values [12.8, 13.6, 14.1, 19.4, 21.4] in the 'total' column with 100 and print it out without changing the dataframe.

## Loading and Saving Data

### 2.1 Use the `getcwd()` function of the python module `os` to see your current working directory in which you save data and load it from. If you want to change that directory use the `chdir()` function.  

### 2.2 Before loading and saving data with pandas, let's get the dataset from seaborn first. Load the 'car_crashes' dataset with seaborn's load_dataset function. 

### 2.3 Now use pandas' dataframe method `to_csv` to save the dataframe to a csv file. Make sure index is not saved and use spaces (' ') as delimiters between datapoints. If you are not sure how to do that you might want to look it up in the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html).

### 2.4 Now load your saved file again with `pd.read_csv()` and add a `_csv` to its variable name. Remember to load it with the same delimiter as you saved it. Print the dataframe afterwards to see if it worked. 

## Groupby

###  In a first step we are going to add a column indicating which of the four regions 'Northeast', 'South', 'Midwest', or 'West' a state belings to. After that we can group the dataframe according to those regions. 

In [None]:
crash_df['US Region'] = ['South', 'West', 'West', 'South', 'West', 'West', 'Northeast', 'South', 'South', 'South', 'South', 'West', 'West', 'Midwest', 'Midwest', 'Midwest',
'Midwest', 'South', 'South', 'Northeast', 'South', 'Northeast', 'Midwest', 'Midwest', 'South', 'Midwest', 'West', 'Midwest', 'West', 'Northeast', 'Northeast', 'West', 'Northeast', 'South', 'Midwest', 'Midwest', 'South', 'West', 'Northeast', 'Northeast', 'South', 'Midwest',
'South', 'South', 'West', 'Northeast', 'South', 'West', 'South', 'Midwest', 'West']

### 3.1 To get familiar with the groupby method, first try using groupby without a following function and have a look at what type it is applying `groupby('US Region')` to the dataframe. 

### 3.2 As you can see, only applying groupby does not get us far. Let us try to get something more interesting. Use groupby as before and add a `mean()` function at the end to get the mean of each column per US Region.

### 3.3 And now the maximum value of each region for the 'speeding' column.

### 3.4 Using `describe` you can also get summary statistics of the 'speeding' column for each region, give it a try.

### 3.5 You might also select specific columns using a list of names instead of a single column name. Get the standard deviation (`std`) of 'total' and 'alcohol'.