# Recent Graduates Project Overview
**In this project, we'll explore how using the pandas plotting functionality along with the Jupyter notebook interface allows us to explore data quickly using visualizations.**

Using visualizations, we can start to explore questions from the dataset like:

- Do students in more popular majors make more money?
    - Using scatter plots
- How many majors are predominantly male? Predominantly female?
    - Using histograms
- Which category of majors have the most students?
    - Using bar plots


## Summary of Results

# Environment Setup

## Loading Dependencies

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import os
import git
import re

from pathlib import Path

pd.set_option('display.max_row', -1)

In [2]:
# Jupyter magic to display plots inline
%matplotlib inline

## Importing our Recent Graduate Data
We'll be working with a dataset on the job outcomes of students who graduated from college between 2010 and 2012. The original data on job outcomes was released by [American Community Survey](https://www.census.gov/programs-surveys/acs/), which conducts surveys and aggregates the data. FiveThirtyEight cleaned the dataset and released it on their [Github repo](https://github.com/fivethirtyeight/data/tree/master/college-majors).

Each row in the dataset represents a different major in college and contains information on gender diversity, employment rates, median salaries, and more. Here are some of the columns in the dataset:

- `Rank` - Rank by median earnings (the dataset is ordered by this column).
- `Major_code` - Major code.
- `Major` - Major description.
- `Major_category` - Category of major.
- `Total` - Total number of people with major.
- `Sample_size` - Sample size (unweighted) of full-time.
- `Men` - Male graduates.
- `Women` - Female graduates.
- `ShareWomen` - Women as share of total.
- `Employed` - Number employed.
- `Median` - Median salary of full-time, year-round workers.
- `Low_wage_jobs` - Number in low-wage service jobs.
- `Full_time` - Number employed 35 hours or more.
- `Part_time` - Number employed less than 35 hours.

Now that we have an overview of the data we'll be viewing, let's read in the dataset and see what it looks like!

In [3]:
cols = ['Rank', 'Major_code', 'Major', 'Total', 'Men', 'Women',
       'Major_category', 'ShareWomen', 'Sample_size', 'Employed', 'Full_time',
       'Part_time', 'Full_time_year_round', 'Unemployed', 'Unemployment_rate',
       'Median', 'P25th', 'P75th', 'College_jobs', 'Non_college_jobs',
       'Low_wage_jobs']

pd.read_csv('/Users/matthew.i.quinn/Desktop/recent-grads.csv', usecols=cols)

ValueError: max() arg is an empty sequence

Unnamed: 0,Rank,Major_code,Major,Total,Men,Women,Major_category,ShareWomen,Sample_size,Employed,...,Part_time,Full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th,College_jobs,Non_college_jobs,Low_wage_jobs
0,1,2419,PETROLEUM ENGINEERING,2339.0,2057.0,282.0,Engineering,0.120564,36,1976,...,270,1207,37,0.018381,110000,95000,125000,1534,364,193
1,2,2416,MINING AND MINERAL ENGINEERING,756.0,679.0,77.0,Engineering,0.101852,7,640,...,170,388,85,0.117241,75000,55000,90000,350,257,50
2,3,2415,METALLURGICAL ENGINEERING,856.0,725.0,131.0,Engineering,0.153037,3,648,...,133,340,16,0.024096,73000,50000,105000,456,176,0
3,4,2417,NAVAL ARCHITECTURE AND MARINE ENGINEERING,1258.0,1123.0,135.0,Engineering,0.107313,16,758,...,150,692,40,0.050125,70000,43000,80000,529,102,0
4,5,2405,CHEMICAL ENGINEERING,32260.0,21239.0,11021.0,Engineering,0.341631,289,25694,...,5180,16697,1672,0.061098,65000,50000,75000,18314,4440,972
5,6,2418,NUCLEAR ENGINEERING,2573.0,2200.0,373.0,Engineering,0.144967,17,1857,...,264,1449,400,0.177226,65000,50000,102000,1142,657,244
6,7,6202,ACTUARIAL SCIENCE,3777.0,2110.0,1667.0,Business,0.441356,51,2912,...,296,2482,308,0.095652,62000,53000,72000,1768,314,259
7,8,5001,ASTRONOMY AND ASTROPHYSICS,1792.0,832.0,960.0,Physical Sciences,0.535714,10,1526,...,553,827,33,0.021167,62000,31500,109000,972,500,220
8,9,2414,MECHANICAL ENGINEERING,91227.0,80320.0,10907.0,Engineering,0.119559,1029,76442,...,13101,54639,4650,0.057342,60000,48000,70000,52844,16384,3253
9,10,2408,ELECTRICAL ENGINEERING,81527.0,65511.0,16016.0,Engineering,0.19645,631,61928,...,12695,41413,3895,0.059174,60000,45000,72000,45829,10874,3170


In [9]:
# Read in the data
repo_root = Path(git.Repo(os.getcwd(), search_parent_directories=True).git.rev_parse("--show-toplevel"))
file_name = 'recent-grads.csv'
file_path = f'{repo_root}/data/{file_name}'
recent_grads = pd.read_csv(file_path, delimiter=',')

# Quick exploration of the data
recent_grads

ValueError: max() arg is an empty sequence

Unnamed: 0,Rank,Major_code,Major,Total,Men,Women,Major_category,ShareWomen,Sample_size,Employed,...,Part_time,Full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th,College_jobs,Non_college_jobs,Low_wage_jobs
0,1,2419,PETROLEUM ENGINEERING,2339.0,2057.0,282.0,Engineering,0.120564,36,1976,...,270,1207,37,0.018381,110000,95000,125000,1534,364,193
1,2,2416,MINING AND MINERAL ENGINEERING,756.0,679.0,77.0,Engineering,0.101852,7,640,...,170,388,85,0.117241,75000,55000,90000,350,257,50
2,3,2415,METALLURGICAL ENGINEERING,856.0,725.0,131.0,Engineering,0.153037,3,648,...,133,340,16,0.024096,73000,50000,105000,456,176,0
3,4,2417,NAVAL ARCHITECTURE AND MARINE ENGINEERING,1258.0,1123.0,135.0,Engineering,0.107313,16,758,...,150,692,40,0.050125,70000,43000,80000,529,102,0
4,5,2405,CHEMICAL ENGINEERING,32260.0,21239.0,11021.0,Engineering,0.341631,289,25694,...,5180,16697,1672,0.061098,65000,50000,75000,18314,4440,972
5,6,2418,NUCLEAR ENGINEERING,2573.0,2200.0,373.0,Engineering,0.144967,17,1857,...,264,1449,400,0.177226,65000,50000,102000,1142,657,244
6,7,6202,ACTUARIAL SCIENCE,3777.0,2110.0,1667.0,Business,0.441356,51,2912,...,296,2482,308,0.095652,62000,53000,72000,1768,314,259
7,8,5001,ASTRONOMY AND ASTROPHYSICS,1792.0,832.0,960.0,Physical Sciences,0.535714,10,1526,...,553,827,33,0.021167,62000,31500,109000,972,500,220
8,9,2414,MECHANICAL ENGINEERING,91227.0,80320.0,10907.0,Engineering,0.119559,1029,76442,...,13101,54639,4650,0.057342,60000,48000,70000,52844,16384,3253
9,10,2408,ELECTRICAL ENGINEERING,81527.0,65511.0,16016.0,Engineering,0.19645,631,61928,...,12695,41413,3895,0.059174,60000,45000,72000,45829,10874,3170


# Data Cleaning & Imputation
As we can see from the above, there are a few data points that need cleaned up and imputated. Let's get started with addressing that.

## Column Headers
The column names are camel case and they fail to adhere to the general syntax (one_two > oneTwo). Hence, us needing to clean that up.

In [3]:
# Function to help us easily clean up our columns
def clean_col_headers(col):
    # Adding a '_' between camel case col and lowercasing it
    return re.sub(r"(\w)([A-Z])", r"\1_\2", col).lower()

In [4]:
# Creating a copy of our dataframe
autos_use = autos.copy()

# Cleaning up our columns
autos_use.columns = [clean_col_headers(col) for col in autos_use.columns]

# Viewing our cleaned up columns
autos_use.columns

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'year_of_registration', 'gearbox', 'power_ps', 'model',
       'odometer', 'month_of_registration', 'fuel_type', 'brand',
       'not_repaired_damage', 'date_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

## Converting `price` & `odometer`
The `price` and `odometer` columns are currently represented as strings, but in order to calculate any artithmetic operations, we'll need to convert them to numeric values.

In [5]:
# Taking a peak at these columns
autos_use[['price', 'odometer']].head(25)


# Removes string characters from rows in a column
def remove_string_characters(df, col='price', strings_to_replace=['$', ',', 'km']):
    for char in strings_to_replace:
        df[col] = df[col].str.replace(char,"")
    return df

### `price`
`price` currently contains values such as `$` and the comma delimiter. Let's look to remove this from the column in order to convert to numeric values.

In [6]:
# Removing string characters from the price column
autos_use = remove_string_characters(autos_use, 'price')

# Converting price to a numeric value
autos_use.price = autos_use.price.astype(float)

In [7]:
# Quick exploration of the data
print('Data Shape:')
display(autos_use.price.unique().shape)
print('Data Descriptives:')
display(autos_use.price.describe().round(2))
print('Ascending Odometer Values:')
display(autos_use.price.value_counts().sort_index().head(10))
print('Descending Odometer Values:')
display(autos_use.price.value_counts().sort_index(ascending=False).head(10))

Data Shape:


(2357,)

Data Descriptives:


            50000.00
count        9840.04
mean       481104.38
std             0.00
min          1100.00
25%          2950.00
50%          7200.00
75%          9840.04
mean       481104.38
std             0.00
min          1100.00
25%          2950.00
50%          7200.00
75%         ...     
max      99999999.00
Name: price, Length: 8, dtype: float64

Ascending Odometer Values:


        1421
0.0      156
1.0        3
2.0        1
3.0        2
5.0        1
8.0        1
9.0        7
10.0       2
11.0     156
1.0        3
2.0        1
3.0        2
5.0        1
8.0        1
9.0        7
10.0       2
11.0    ... 
12.0       3
Name: price, Length: 10, dtype: int64

Descending Odometer Values:


              1
99999999.0    1
27322222.0    3
12345678.0    2
11111111.0    1
10000000.0    1
3890000.0     1
1300000.0     1
1234566.0     2
999999.0      1
27322222.0    3
12345678.0    2
11111111.0    1
10000000.0    1
3890000.0     1
1300000.0     1
1234566.0     2
999999.0     ..
999990.0      1
Name: price, Length: 10, dtype: int64

### `odometer`
Similar to `price`, the values in the odometer column contain string characters. Let's go ahead and remove them!

In [8]:
# Removing string characters from the odometer column
autos_use = remove_string_characters(autos_use, 'odometer')

# Converting odometer to a numeric value
autos_use.odometer = autos_use.odometer.astype(float)

# Renaming the odometer column to odometer_km
autos_use.rename({'odometer':'odometer_km'}, axis=1, inplace=True)

In [9]:
# Quick exploration of the data
print('Data Shape:')
display(autos_use.odometer_km.unique().shape)
print('Data Descriptives:')
display(autos_use.odometer_km.describe().round(2))
print('Ascending Odometer Values:')
display(autos_use.odometer_km.value_counts().sort_index().head(10))
print('Descending Odometer Values:')
display(autos_use.odometer_km.value_counts().sort_index(ascending=False).head(10))

Data Shape:


(13,)

Data Descriptives:


          50000.00
count    125732.70
mean      40042.21
std        5000.00
min      125000.00
25%      150000.00
50%      150000.00
75%      125732.70
mean      40042.21
std        5000.00
min      125000.00
25%      150000.00
50%      150000.00
75%        ...    
max      150000.00
Name: odometer_km, Length: 8, dtype: float64

Ascending Odometer Values:


            967
5000.0      264
10000.0     784
20000.0     789
30000.0     819
40000.0    1027
50000.0    1164
60000.0    1230
70000.0    1436
80000.0     264
10000.0     784
20000.0     789
30000.0     819
40000.0    1027
50000.0    1164
60000.0    1230
70000.0    1436
80000.0    ... 
90000.0    1757
Name: odometer_km, Length: 10, dtype: int64

Descending Odometer Values:


            32424
150000.0     5170
125000.0     2169
100000.0     1757
90000.0      1436
80000.0      1230
70000.0      1164
60000.0      1027
50000.0       819
40000.0      5170
125000.0     2169
100000.0     1757
90000.0      1436
80000.0      1230
70000.0      1164
60000.0      1027
50000.0       819
40000.0     ...  
30000.0       789
Name: odometer_km, Length: 10, dtype: int64

# Future Consideration
In this project, we practiced applying a variety of pandas methods to explore and understand a data set on car listings. Here are some next steps for you to consider:

Data cleaning next steps:

- Identify categorical data that uses german words, translate them and map the values to their english counterparts
- Convert the dates to be uniform numeric data, so "2016-03-21" becomes the integer `20160321`.
- See if there are particular keywords in the name column that you can extract as new columns

Analysis next steps:

- Find the most common brand/model combinations
- Split the odometer_km into groups, and use aggregation to see if average prices follows any patterns based on the milage.
- How much cheaper are cars with damage than their non-damaged counterparts?