# Exploratory Data Analysis, Part 1

In this notebook, you will learn some techniques to explore a dataset. The tools you will learn about are:
* Checking and Fixing Datatypes
* Recoding Categorical Variables using `.replace()`
* Computing Summary Statistics with `.describe()
* Computing Aggregation by Category with `.groupby`

## Step 1: Read in Data and Check Datatypes

In [30]:
import pandas as pd

In [31]:
cars = pd.read_csv('../data/auto-mpg.csv')

Take a peek at the data and then check the datatypes with `.info()`.

In [34]:
cars.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino


In [35]:
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    398 non-null    object 
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model year    398 non-null    int64  
 7   origin        398 non-null    int64  
 8   car name      398 non-null    object 
dtypes: float64(3), int64(4), object(2)
memory usage: 28.1+ KB


**Question:** Does anything look amiss?

You can convert a column to a numeric datatype using the pandas `to_numeric` function. Let's try that with the horsepower column.

In [38]:
pd.to_numeric(cars['horsepower']) 

ValueError: Unable to parse string "?" at position 32

Something is not right. We get a clue in the output. There is something wrong with row 32. You can access this row using `.loc`.

In [40]:
cars.loc[32]

mpg                     25
cylinders                4
displacement            98
horsepower               ?
weight                2046
acceleration            19
model year              71
origin                   1
car name        ford pinto
Name: 32, dtype: object

You can let the `to_numeric` function know what to do with rows that it can't figure out how to convert by using the `errors` argument. Here, you can have it put a NaN (not a number) is positions it can't figure out how to convert.

In [41]:
cars['horsepower'] = pd.to_numeric(cars['horsepower'], errors = 'coerce')

Now, check that the changes stuck.

In [42]:
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    392 non-null    float64
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model year    398 non-null    int64  
 7   origin        398 non-null    int64  
 8   car name      398 non-null    object 
dtypes: float64(4), int64(4), object(1)
memory usage: 28.1+ KB


**Question:** Does anything look strange now?

If you want to inspect rows with missing values, you can use the `.isna()` method in conjunction with `.loc`.

In [43]:
cars.loc[cars['horsepower'].isna()]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
32,25.0,4,98.0,,2046,19.0,71,1,ford pinto
126,21.0,6,200.0,,2875,17.0,74,1,ford maverick
330,40.9,4,85.0,,1835,17.3,80,2,renault lecar deluxe
336,23.6,4,140.0,,2905,14.3,80,1,ford mustang cobra
354,34.5,4,100.0,,2320,15.8,81,2,renault 18i
374,23.0,4,151.0,,3035,20.5,82,1,amc concord dl


## Step 2: Replace Origin

You may have notice that the origin column contains numeric values.

In [44]:
cars['origin'].value_counts()

1    249
3     79
2     70
Name: origin, dtype: int64

It turns out that this is a categorical variable which is coded using integers. You can replace the origin codes with what those codes stand for by using the `.replace` method. 

One way to use this method is to pass in a **dictionary**. That is, a collection of key-value pairs. Dictionaries in Python can be created using curly braces{ } and listing key: value.

In [45]:
origin_lookup = {1: 'American', 2: 'European', 3: 'Japanese'}

In [46]:
cars['origin'] = cars['origin'].replace(origin_lookup)

In [47]:
cars.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8,307.0,130.0,3504,12.0,70,American,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,American,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,American,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,American,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,American,ford torino


## Step 3: Summary Statistics

Now, you can start investigating the data further. Pandas has a lot of built-in functionality for computing summary statistics. For example, the `.describe()` method will give, mean, standard deviation, and the five-number summary.

In [48]:
cars['mpg'].describe()

count    398.000000
mean      23.514573
std        7.815984
min        9.000000
25%       17.500000
50%       23.000000
75%       29.000000
max       46.600000
Name: mpg, dtype: float64

## Step 4: Aggregation by Group

Perhaps you suspect that the variables may be distributed differently for different levels of the categorical varaibles. 

In `pandas`, if we want to explore categorical vs numeric variables, we usually do it by using `groupby`. To groupby, you need to specify the column(s) to group on, followed by the column you want to aggregate, and finally an aggregation type.

Let's say you want to see how the average mpg varies based on origin.
* groupby: origin
* column to aggregate: mpg
* aggregation: mean

In [49]:
cars.groupby('origin')['mpg'].mean()

origin
American    20.083534
European    27.891429
Japanese    30.450633
Name: mpg, dtype: float64

**Question:** How did average horsepower change across years?

In [50]:
# Your Code Here

Finally, what if you want to perform an aggregation but then do something further with the result? First, repeat the aggregation done above but save the result to a variable.

In [56]:
origin_mpg = cars.groupby('origin')['mpg'].mean()

What is origin_mpg?

In [57]:
type(origin_mpg)

pandas.core.series.Series

We can convert `origin_mpg` to a DataFrame by calling the `.reset_index()` method on it.

In [58]:
origin_mpg = origin_mpg.reset_index()
origin_mpg

Unnamed: 0,origin,mpg
0,American,20.083534
1,European,27.891429
2,Japanese,30.450633
