Note: Your code/analysis should be self-explanatory with necessary comments.

# Question 1

**Use the dataset fatalities along with the dataset description**

- **Analyze the dataset and identify any missing values, outliers, or other issues.**
- **Normalize the numerical features in the dataset if necessary, by using methods such as min-max normalization or z-score normalization**
- **Discretize any continuous features in the dataset if necessary, by using methods such as equal width binning or equal frequency binning**
- **Fill in any missing values in the dataset, by using methods such as mean imputation, median imputation, or regression imputation (optional)**

In [1]:
import pandas as pd
import numpy as np
fatal = pd.read_csv('Fatalities.csv')

In [2]:
fatal.drop(columns = ['Unnamed: 0'],inplace=True)

In [3]:
fatal.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 336 entries, 0 to 335
Data columns (total 34 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   state         336 non-null    object 
 1   year          336 non-null    int64  
 2   spirits       336 non-null    float64
 3   unemp         336 non-null    float64
 4   income        336 non-null    float64
 5   emppop        336 non-null    float64
 6   beertax       336 non-null    float64
 7   baptist       336 non-null    float64
 8   mormon        336 non-null    float64
 9   drinkage      336 non-null    float64
 10  dry           336 non-null    float64
 11  youngdrivers  336 non-null    float64
 12  miles         336 non-null    float64
 13  breath        336 non-null    object 
 14  jail          335 non-null    object 
 15  service       335 non-null    object 
 16  fatal         336 non-null    int64  
 17  nfatal        336 non-null    int64  
 18  sfatal        336 non-null    

> So there are no missing values in the 'fatalities' dataset

In [4]:
fatal.describe()

Unnamed: 0,year,spirits,unemp,income,emppop,beertax,baptist,mormon,drinkage,dry,...,nfatal2124,afatal,pop,pop1517,pop1820,pop2124,milestot,unempus,emppopus,gsp
count,336.0,336.0,336.0,336.0,336.0,336.0,336.0,336.0,336.0,336.0,...,336.0,336.0,336.0,336.0,336.0,336.0,336.0,336.0,336.0,336.0
mean,1985.0,1.75369,7.346726,13880.184533,60.805676,0.513256,7.156925,2.801933,20.455625,4.267074,...,41.377976,293.333247,4930272.0,230815.5,249090.4,336389.9,37101.491151,7.528571,59.97143,0.025313
std,2.002983,0.683575,2.533405,2253.046291,4.721656,0.477844,9.762621,9.665279,0.899025,9.500901,...,42.930315,303.580749,5073704.0,229896.3,249345.6,345304.4,37454.365758,1.479376,1.585048,0.043173
min,1982.0,0.79,2.4,9513.761719,42.993198,0.043311,0.0,0.1,18.0,0.0,...,1.0,24.6,478999.7,21000.02,20999.96,30000.16,3993.0,5.5,57.799999,-0.123641
25%,1983.0,1.3,5.475,12085.849854,57.691426,0.208849,0.626752,0.27216,20.0,0.0,...,13.0,90.497749,1545251.0,71749.93,76962.12,103500.0,11691.500244,6.2,57.900002,0.001182
50%,1985.0,1.67,7.0,13763.128906,61.36466,0.352589,1.74925,0.393111,21.0,0.086812,...,30.0,211.594002,3310503.0,163000.2,170982.3,240999.9,28483.5,7.2,60.100002,0.032413
75%,1987.0,2.0125,8.9,15175.124268,64.412504,0.651573,13.127125,0.62932,21.0,2.42481,...,49.0,363.957748,5751735.0,270500.2,308311.4,413000.1,44139.75,9.6,61.5,0.056501
max,1988.0,4.9,18.0,22193.455078,71.268654,2.720764,30.3557,65.916496,21.0,45.792099,...,249.0,2094.899902,28314030.0,1172000.0,1321004.0,1892998.0,241575.015625,9.7,62.300003,0.142361


In [5]:
fatal.year.value_counts()

1982    48
1983    48
1984    48
1985    48
1986    48
1987    48
1988    48
Name: year, dtype: int64

In [6]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler     # z-score normalization


Given the information provided by the fatalities dataset description sheet, I think the following columns can potentially benefit the most from the following transformations:

## Z-score normalization: 

- `unemp`: Unemployment rate

- `income`: Per capita personal income in 1987 dollars

- `emppop`: Employment/population ratio

## Min-Max normalization: 

fatality-related columns: `['fatal', 'nfatal', 'sfatal', 'fatal1517', 'nfatal1517', 'fatal1820', 'nfatal1820', 'fatal2124', 'nfatal2124', 'afatal']`

population-related columns: `['pop', 'pop1517', 'pop1820', 'pop2124']`

## Discretization:

`State`: could be discretized to certain regions of the US (west-coast, east-coast, etc.)

`drinkage`: (Minimum legal drinking age) could be discretized to above, below or at 21 years old

In [7]:
# z-score scaling 
z_col = ['unemp','income','emppop','unempus','emppopus']
z_df = fatal[z_col]
zscale = StandardScaler()
fatal_zscale=pd.DataFrame(zscale.fit_transform(z_df),columns=z_df.columns)
fatal_zscale.describe()

Unnamed: 0,unemp,income,emppop,unempus,emppopus
count,336.0,336.0,336.0,336.0,336.0
mean,-2.167578e-16,-2.061843e-16,3.383537e-16,3.515706e-16,6.378496e-15
std,1.001491,1.001491,1.001491,1.001491,1.001491
min,-1.955512,-1.940899,-3.778133,-1.373279,-1.37199
25%,-0.7399204,-0.7975916,-0.6605509,-0.8994013,-1.308805
50%,-0.1370659,-0.05203187,0.118564,-0.2224327,0.08123663
75%,0.6140314,0.5756078,0.7650299,1.402292,0.9658066
max,4.211393,3.695294,2.21926,1.469989,1.471278


In [8]:
# min-max scaling of fatality counts
f_col = ['fatal', 'nfatal', 'sfatal', 'fatal1517', 'nfatal1517', 'fatal1820', 'nfatal1820',
         'fatal2124', 'nfatal2124', 'afatal']
mm_df = fatal[f_col]
mm_scale = MinMaxScaler()
fatal_mm_scale = pd.DataFrame(mm_scale.fit_transform(mm_df),columns=mm_df.columns)
fatal_mm_scale.describe()

Unnamed: 0,fatal,nfatal,sfatal,fatal1517,nfatal1517,fatal1820,nfatal1820,fatal2124,nfatal2124,afatal
count,336.0,336.0,336.0,336.0,336.0,336.0,336.0,336.0,336.0,336.0
mean,0.15662,0.16369,0.171344,0.189238,0.161341,0.167779,0.171055,0.151546,0.162814,0.129804
std,0.172175,0.181883,0.18242,0.176918,0.161229,0.175461,0.169583,0.173864,0.173106,0.146636
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.039585,0.039334,0.045378,0.072222,0.052632,0.052189,0.056122,0.039578,0.048387,0.03183
50%,0.114654,0.117761,0.122689,0.146032,0.131579,0.126263,0.122449,0.112797,0.116935,0.090322
75%,0.181475,0.192085,0.206723,0.234921,0.200658,0.207492,0.22449,0.182718,0.193548,0.163917
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [9]:
# min-max scaling of population counts
pop_col = ['pop', 'pop1517', 'pop1820', 'pop2124']
mm_df2 = fatal[pop_col]
mm_scale = MinMaxScaler()
fatal_mm_scale2 = pd.DataFrame(mm_scale.fit_transform(mm_df2),columns=mm_df2.columns)
fatal_mm_scale2.describe()

Unnamed: 0,pop,pop1517,pop1820,pop2124
count,336.0,336.0,336.0,336.0
mean,0.159916,0.18229,0.175454,0.164461
std,0.182278,0.199736,0.191804,0.185349
min,0.0,0.0,0.0,0.0
25%,0.038306,0.044092,0.043048,0.039452
50%,0.101724,0.123371,0.115371,0.113258
75%,0.189428,0.216768,0.221008,0.205583
max,1.0,1.0,1.0,1.0


In [10]:
fatal.state.value_counts

<bound method IndexOpsMixin.value_counts of 0      al
1      al
2      al
3      al
4      al
       ..
331    wy
332    wy
333    wy
334    wy
335    wy
Name: state, Length: 336, dtype: object>

In [11]:
# discretization (attempt 1: states => regions)
west = ['ak','wa','or','ca','nv','id','ut','az','mt','wy','co','nm']
midwest = ['nd','sd','ne','ks','mn','ia','mo','wi','il','in','mi','oh']
south = ['tx','ok','ar','la','ms','ky','tn','al','ga','fl','wv','md','va','nc','sc','de']
northeast = ['pa','ny','nj','vt','nh','ma','ct','ri','me']

In [13]:

conditions = [any(fatal['state'] in west), 
            any(fatal['state'] in midwest),
            any(fatal['state'] in south),
            any(fatal['state'] in northeast)]

regions = ['west coast','midwest','south','east coast']

fatal['region'] = np.select(conditions,regions)

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I can't seem to get the process of discretizing the state regions to work.

In [14]:
# discretization attempt 2
conditions = [(fatal['drinkage'] == 21),
            (fatal['drinkage'] < 21),
            (fatal['drinkage'] > 21)]

values = ['same','below','above']

fatal['drinkagerange'] = np.select(conditions,values)

In [15]:
fatal.head()

Unnamed: 0,state,year,spirits,unemp,income,emppop,beertax,baptist,mormon,drinkage,...,afatal,pop,pop1517,pop1820,pop2124,milestot,unempus,emppopus,gsp,drinkagerange
0,al,1982,1.37,14.4,10544.152344,50.692039,1.539379,30.3557,0.32829,19.0,...,309.437988,3942002.25,208999.59375,221553.4375,290000.0625,28516.0,9.7,57.799999,-0.022125,below
1,al,1983,1.36,13.7,10732.797852,52.14703,1.788991,30.333599,0.34341,19.0,...,341.834015,3960008.0,202000.078125,219125.46875,290000.15625,31032.0,9.6,57.900002,0.046558,below
2,al,1984,1.32,11.1,11108.791016,54.168087,1.714286,30.311501,0.35924,19.0,...,304.872009,3988991.75,196999.96875,216724.09375,288000.15625,32961.0,7.5,59.500004,0.062798,below
3,al,1985,1.28,8.9,11332.626953,55.271137,1.652542,30.289499,0.37579,19.67,...,276.742004,4021007.75,194999.734375,214349.03125,284000.3125,35091.0,7.2,60.100002,0.02749,below
4,al,1986,1.23,9.8,11661.506836,56.514496,1.609907,30.267401,0.39311,21.0,...,360.716003,4049993.75,203999.890625,212000.0,263000.28125,36259.0,7.0,60.700001,0.032143,same


# Question 2

**You have collected the data for assisting students in job seeking (Assignment 01, Part B, Prompt 1) and understood the data as part of the Assignment. Apply the data preprocessing steps above again, when necessary. If you are not happy with this dataset, you can use the titanic dataset in seaborn's built-in datasets.**

In [16]:
jobs_df = pd.read_csv('JasmineKobayashiA02Positions.csv')

In [17]:
jobs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5519 entries, 0 to 5518
Data columns (total 18 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Agency                         5519 non-null   object 
 1   # Of Positions                 5519 non-null   int64  
 2   Business Title                 5519 non-null   object 
 3   Civil Service Title            5519 non-null   object 
 4   Title Classification           5519 non-null   object 
 5   Job Category                   5517 non-null   object 
 6   Full-Time/Part-Time indicator  5293 non-null   object 
 7   Career Level                   5517 non-null   object 
 8   Salary Range From              5519 non-null   float64
 9   Salary Range To                5519 non-null   float64
 10  Salary Frequency               5519 non-null   object 
 11  Work Location                  5519 non-null   object 
 12  Job Description                5519 non-null   o

(There are no missing numeric values to fill in.)

In [20]:
jobs_df.describe()

Unnamed: 0,# Of Positions,Salary Range From,Salary Range To
count,5519.0,5519.0,5519.0
mean,2.389745,62416.435424,86458.164859
std,9.425204,29937.88406,43808.08774
min,1.0,0.0,15.0
25%,1.0,50000.0,62215.0
50%,1.0,60171.0,83399.0
75%,1.0,75504.0,109990.0
max,250.0,250000.0,265000.0


In [21]:
# z-score normalization

sal_col = ['Salary Range From','Salary Range To']

salaries = jobs_df[sal_col]

zscale = StandardScaler()
sal_znorm=pd.DataFrame(zscale.fit_transform(salaries),columns=salaries.columns)
sal_znorm.describe()

Unnamed: 0,Salary Range From,Salary Range To
count,5519.0,5519.0
mean,1.132955e-16,-4.377324e-17
std,1.000091,1.000091
min,-2.085054,-1.973403
25%,-0.4147775,-0.5534449
50%,-0.07500994,-0.06983738
75%,0.4371969,0.537206
max,6.266327,4.075914


In [22]:
# min-max normalization

mm_scale = MinMaxScaler()
sal_mm = pd.DataFrame(mm_scale.fit_transform(salaries),columns=salaries.columns)
sal_mm.describe()

Unnamed: 0,Salary Range From,Salary Range To
count,5519.0,5519.0
mean,0.249666,0.326219
std,0.119752,0.165323
min,0.0,0.0
25%,0.2,0.23473
50%,0.240684,0.314674
75%,0.302016,0.415023
max,1.0,1.0


In [24]:
jobs_df['Job Category'].value_counts()

Engineering, Architecture, & Planning                                                                                                                                                             682
Technology, Data & Innovation                                                                                                                                                                     474
Legal Affairs                                                                                                                                                                                     444
Building Operations & Maintenance                                                                                                                                                                 304
Finance, Accounting, & Procurement                                                                                                                                                                296
          

If I figured out how to do it correctly but it would also be interesting to discretize the job categories since they seem to be particularly specific (and some have typos and some list multiple). (But I haven't been able to figure out how to accomplish that within the time working on this assignment). 