## Data Wrangling For Capstone 2 Project

### Introduction
We have received the csv file from the client and now need to evaluate the data for completeness and formatting ot make sure it can be well-received for pre-processing. We will intially be evaluating it specifically for any missing values and to make sure the formats of entries are in the most usable form

In [1]:
#Import the NumPy and Pandas for better evaluation of our data

import numpy as np
import pandas as pd

In [2]:
#Next we will use pandas to import the dataset into a dataframe
df_raw = pd.read_csv('Placement_Data_Full_Class.csv')
print(df_raw.head())

   sl_no gender  ssc_p    ssc_b  hsc_p    hsc_b     hsc_s  degree_p  \
0      1      M  67.00   Others  91.00   Others  Commerce     58.00   
1      2      M  79.33  Central  78.33   Others   Science     77.48   
2      3      M  65.00  Central  68.00  Central      Arts     64.00   
3      4      M  56.00  Central  52.00  Central   Science     52.00   
4      5      M  85.80  Central  73.60  Central  Commerce     73.30   

    degree_t workex  etest_p specialisation  mba_p      status    salary  
0   Sci&Tech     No     55.0         Mkt&HR  58.80      Placed  270000.0  
1   Sci&Tech    Yes     86.5        Mkt&Fin  66.28      Placed  200000.0  
2  Comm&Mgmt     No     75.0        Mkt&Fin  57.80      Placed  250000.0  
3   Sci&Tech     No     66.0         Mkt&HR  59.43  Not Placed       NaN  
4  Comm&Mgmt     No     96.8        Mkt&Fin  55.50      Placed  425000.0  


It's clear that this dataset will need some preparation and cleaning to make it more usable down the road. In addition, we should check to see whether there are any missing entries

In [3]:
#Let's get a key of all the fields so we can quickly check all the characteristics of our observations
df_keys = df_raw.keys()
print(df_keys)

Index(['sl_no', 'gender', 'ssc_p', 'ssc_b', 'hsc_p', 'hsc_b', 'hsc_s',
       'degree_p', 'degree_t', 'workex', 'etest_p', 'specialisation', 'mba_p',
       'status', 'salary'],
      dtype='object')


Based on the information provided by the client, we can at least consider what each characteristic entails:
* sl_no: Is the serial number for an entry (student going forward), which is effectively an index key
* gender: Provides the gender for a student (M for Male, and F for Female)
* ssc_p: The percentile Rank of the student in 10th grade
* ssc_b: The Board of Education associated with the student's 10th grade class (Central or Other)
* hsc_p: The percentile rank of the student in 12th grade
* hsc_b: The Board of Education associated with the student's 12 grade class Central or Other)
* hsc_s: The specialization of the student in 12th grade (Commerce, Science, or Other)
* degree_p: The percentile rank of the student at completion of their undergradute degree
* degree_t: The undergraduate degree that the student acquired which are Communications & Management (Comm&Mgmt), Science & Technology (Sci&Tech), or Other
* workex: Whether the student has work experience or not
* etest_p: The percentile rank of thes student on their employability test
* specialization: The specialization of the student which are Marketing & Finance (Mkt&Fin) or Marketing & Human Resources (Mkt&HR)
* mba_p: The percentile ranking of the student once they've completed their MBA
* status: Whether a student has been placed (aka outcome) which are Placed or Not Placed
* salary: The salary of a student upon placement

In [4]:
#First, let's get a basline for the type of information we are working with
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 215 entries, 0 to 214
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   sl_no           215 non-null    int64  
 1   gender          215 non-null    object 
 2   ssc_p           215 non-null    float64
 3   ssc_b           215 non-null    object 
 4   hsc_p           215 non-null    float64
 5   hsc_b           215 non-null    object 
 6   hsc_s           215 non-null    object 
 7   degree_p        215 non-null    float64
 8   degree_t        215 non-null    object 
 9   workex          215 non-null    object 
 10  etest_p         215 non-null    float64
 11  specialisation  215 non-null    object 
 12  mba_p           215 non-null    float64
 13  status          215 non-null    object 
 14  salary          148 non-null    float64
dtypes: float64(6), int64(1), object(8)
memory usage: 25.3+ KB


It's clear that there are some changes we should consider to the data and some things to investigate:
* Many of the categorical variables (e.g., gender, placement, etc) are of the 'object' type when they would most likely better be either Booleans or a simply integer of 0 or 1
* 'salary' has less non-null entries than the other characateristics which suggests we should see if there is a pattern to the 

Let's begin by converting the characteristics that are categorical to a better form

In [5]:
#The 'gender' characteristics can be converted to simply a 0 for Male and 1 for Female to make it more usable.
#That said, we need to apply a few steps to get there. First, let's investigate the entries
print(df_raw['gender'])
print(df_raw['gender'].dtype)
print(df_raw['gender'].unique())

0      M
1      M
2      M
3      M
4      M
      ..
210    M
211    M
212    M
213    F
214    M
Name: gender, Length: 215, dtype: object
object
['M' 'F']


In [6]:
#Alright, it's clear that we have entries for Male as 'M' and Female as 'F' and no missing values
#Let's make it a column that identifies "0" for Male and 1 for Female "1"
df_raw_gender = np.array(df_raw['gender'] == 'F')
df_raw_gender = df_raw_gender.astype(int)
print(df_raw_gender)

[0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 1 0 1 1 0 0 1 1 1 0 1 0 0 0 0 1 1 1 1 0 1 0
 1 1 0 1 1 0 0 1 1 1 0 0 1 1 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0
 0 1 1 0 0 1 1 0 0 0 0 1 0 0 1 1 1 0 1 0 0 0 1 1 1 0 1 0 1 0 0 0 0 0 0 0 1
 0 0 1 0 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0
 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 0 1 1 0 1 0 0 0 0 1
 1 1 0 0 1 1 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0]


In [7]:
##Now we can replace the 'gender' values with 0 and 1, and then replace the 'gender' key with 'Female'
df_raw['gender'] = df_raw_gender
df_raw.rename(columns={'gender':'female'}, inplace=True)
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 215 entries, 0 to 214
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   sl_no           215 non-null    int64  
 1   female          215 non-null    int32  
 2   ssc_p           215 non-null    float64
 3   ssc_b           215 non-null    object 
 4   hsc_p           215 non-null    float64
 5   hsc_b           215 non-null    object 
 6   hsc_s           215 non-null    object 
 7   degree_p        215 non-null    float64
 8   degree_t        215 non-null    object 
 9   workex          215 non-null    object 
 10  etest_p         215 non-null    float64
 11  specialisation  215 non-null    object 
 12  mba_p           215 non-null    float64
 13  status          215 non-null    object 
 14  salary          148 non-null    float64
dtypes: float64(6), int32(1), int64(1), object(7)
memory usage: 24.5+ KB


We've converted 'gender' to a 'female' identifier that is "0" for males and "1" for females that should make it easier to work with in the long run. The next characteristic to check would be "ssc_b"

>*As a note, I'll try to move through the remaining conversions of this type quickly for reader convenience. The fields that will be adjusted are ssc_b, hsc_b, status, and workex*

In [8]:
#Investigate the properties and entries of 'ssc_b' characteristic
print(df_raw['ssc_b'])
print(df_raw['ssc_b'].dtype)
print(df_raw['ssc_b'].unique())

0       Others
1      Central
2      Central
3      Central
4      Central
        ...   
210     Others
211     Others
212     Others
213     Others
214    Central
Name: ssc_b, Length: 215, dtype: object
object
['Others' 'Central']


In [9]:
#Convert the series to a 1 for "ssc_Central" and "0" for Other and relabel the column name
df_raw_ssc_b = np.array(df_raw['ssc_b'] == 'Central')
df_raw_ssc_b = df_raw_ssc_b.astype(int)
df_raw['ssc_b'] = df_raw_ssc_b
df_raw.rename(columns={'ssc_b':'ssc_Central'}, inplace=True)
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 215 entries, 0 to 214
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   sl_no           215 non-null    int64  
 1   female          215 non-null    int32  
 2   ssc_p           215 non-null    float64
 3   ssc_Central     215 non-null    int32  
 4   hsc_p           215 non-null    float64
 5   hsc_b           215 non-null    object 
 6   hsc_s           215 non-null    object 
 7   degree_p        215 non-null    float64
 8   degree_t        215 non-null    object 
 9   workex          215 non-null    object 
 10  etest_p         215 non-null    float64
 11  specialisation  215 non-null    object 
 12  mba_p           215 non-null    float64
 13  status          215 non-null    object 
 14  salary          148 non-null    float64
dtypes: float64(6), int32(2), int64(1), object(6)
memory usage: 23.6+ KB


In [10]:
#Investigate the properties and entries of 'hsc_b' characteristic
print(df_raw['hsc_b'])
print(df_raw['hsc_b'].dtype)
print(df_raw['hsc_b'].unique())

0       Others
1       Others
2      Central
3      Central
4      Central
        ...   
210     Others
211     Others
212     Others
213     Others
214     Others
Name: hsc_b, Length: 215, dtype: object
object
['Others' 'Central']


In [11]:
#Convert the series to a 1 for "hsc_Central" and "0" for Other and relabel the column name
df_raw_hsc_b = np.array(df_raw['hsc_b'] == 'Central')
df_raw_hsc_b = df_raw_hsc_b.astype(int)
df_raw['hsc_b'] = df_raw_hsc_b
df_raw.rename(columns={'hsc_b':'hsc_Central'}, inplace=True)
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 215 entries, 0 to 214
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   sl_no           215 non-null    int64  
 1   female          215 non-null    int32  
 2   ssc_p           215 non-null    float64
 3   ssc_Central     215 non-null    int32  
 4   hsc_p           215 non-null    float64
 5   hsc_Central     215 non-null    int32  
 6   hsc_s           215 non-null    object 
 7   degree_p        215 non-null    float64
 8   degree_t        215 non-null    object 
 9   workex          215 non-null    object 
 10  etest_p         215 non-null    float64
 11  specialisation  215 non-null    object 
 12  mba_p           215 non-null    float64
 13  status          215 non-null    object 
 14  salary          148 non-null    float64
dtypes: float64(6), int32(3), int64(1), object(5)
memory usage: 22.8+ KB


In [12]:
#Investigate the properties and entries of 'status' characteristic
print(df_raw['status'])
print(df_raw['status'].dtype)
print(df_raw['status'].unique())

0          Placed
1          Placed
2          Placed
3      Not Placed
4          Placed
          ...    
210        Placed
211        Placed
212        Placed
213        Placed
214    Not Placed
Name: status, Length: 215, dtype: object
object
['Placed' 'Not Placed']


In [13]:
#Convert the series to a 1 for "Placed" and "0" for Not Placed and relabel the column name to "Employed"
df_raw_status = np.array(df_raw['status'] == 'Placed')
df_raw_status = df_raw_status.astype(int)
df_raw['status'] = df_raw_status
df_raw.rename(columns={'status':'employed'}, inplace=True)
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 215 entries, 0 to 214
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   sl_no           215 non-null    int64  
 1   female          215 non-null    int32  
 2   ssc_p           215 non-null    float64
 3   ssc_Central     215 non-null    int32  
 4   hsc_p           215 non-null    float64
 5   hsc_Central     215 non-null    int32  
 6   hsc_s           215 non-null    object 
 7   degree_p        215 non-null    float64
 8   degree_t        215 non-null    object 
 9   workex          215 non-null    object 
 10  etest_p         215 non-null    float64
 11  specialisation  215 non-null    object 
 12  mba_p           215 non-null    float64
 13  employed        215 non-null    int32  
 14  salary          148 non-null    float64
dtypes: float64(6), int32(4), int64(1), object(4)
memory usage: 22.0+ KB


In [14]:
#Investigate the properties and entries of 'workex' characteristic
print(df_raw['workex'])
print(df_raw['workex'].dtype)
print(df_raw['workex'].unique())

0       No
1      Yes
2       No
3       No
4       No
      ... 
210     No
211     No
212    Yes
213     No
214     No
Name: workex, Length: 215, dtype: object
object
['No' 'Yes']


In [15]:
#Convert the series to a 1 for "Experience" and "0" for No Experience and relabel the column name to "work_experience"
df_raw_workex = np.array(df_raw['workex'] == 'Yes')
df_raw_workex = df_raw_workex.astype(int)
df_raw['workex'] = df_raw_workex
df_raw.rename(columns={'workex':'work_experience'}, inplace=True)
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 215 entries, 0 to 214
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   sl_no            215 non-null    int64  
 1   female           215 non-null    int32  
 2   ssc_p            215 non-null    float64
 3   ssc_Central      215 non-null    int32  
 4   hsc_p            215 non-null    float64
 5   hsc_Central      215 non-null    int32  
 6   hsc_s            215 non-null    object 
 7   degree_p         215 non-null    float64
 8   degree_t         215 non-null    object 
 9   work_experience  215 non-null    int32  
 10  etest_p          215 non-null    float64
 11  specialisation   215 non-null    object 
 12  mba_p            215 non-null    float64
 13  employed         215 non-null    int32  
 14  salary           148 non-null    float64
dtypes: float64(6), int32(5), int64(1), object(3)
memory usage: 21.1+ KB


In [16]:
df_raw.head(10)

Unnamed: 0,sl_no,female,ssc_p,ssc_Central,hsc_p,hsc_Central,hsc_s,degree_p,degree_t,work_experience,etest_p,specialisation,mba_p,employed,salary
0,1,0,67.0,0,91.0,0,Commerce,58.0,Sci&Tech,0,55.0,Mkt&HR,58.8,1,270000.0
1,2,0,79.33,1,78.33,0,Science,77.48,Sci&Tech,1,86.5,Mkt&Fin,66.28,1,200000.0
2,3,0,65.0,1,68.0,1,Arts,64.0,Comm&Mgmt,0,75.0,Mkt&Fin,57.8,1,250000.0
3,4,0,56.0,1,52.0,1,Science,52.0,Sci&Tech,0,66.0,Mkt&HR,59.43,0,
4,5,0,85.8,1,73.6,1,Commerce,73.3,Comm&Mgmt,0,96.8,Mkt&Fin,55.5,1,425000.0
5,6,0,55.0,0,49.8,0,Science,67.25,Sci&Tech,1,55.0,Mkt&Fin,51.58,0,
6,7,1,46.0,0,49.2,0,Commerce,79.0,Comm&Mgmt,0,74.28,Mkt&Fin,53.29,0,
7,8,0,82.0,1,64.0,1,Science,66.0,Sci&Tech,1,67.0,Mkt&Fin,62.14,1,252000.0
8,9,0,73.0,1,79.0,1,Commerce,72.0,Comm&Mgmt,0,91.34,Mkt&Fin,61.29,1,231000.0
9,10,0,58.0,1,70.0,1,Commerce,61.0,Comm&Mgmt,0,54.0,Mkt&Fin,52.21,0,


We'll now investigate the 'hsc_s' characteristic to see the possible outcomes that exist as it appears

In [17]:
#Investigate the properties and entries of 'hsc_s' characteristic
print(df_raw['hsc_s'])
print(df_raw['hsc_s'].dtype)
print(df_raw['hsc_s'].unique())

0      Commerce
1       Science
2          Arts
3       Science
4      Commerce
         ...   
210    Commerce
211     Science
212    Commerce
213    Commerce
214     Science
Name: hsc_s, Length: 215, dtype: object
object
['Commerce' 'Science' 'Arts']


We see here a categorical variable that can take on 3 different values: Commerce, Science, & Arts. We can separate this into 2 dummy variables (n-1 categories). Let's get a count of the different categories.

In [18]:
print(df_raw['hsc_s'].value_counts())

Commerce    113
Science      91
Arts         11
Name: hsc_s, dtype: int64


Since Arts is the "outlier" classification, we can make that the category that is not part of the dummy variables. That said, it's possible that if exclude it due to a lack of substantial observations, we'll have to be careful to exlcude one of the other categories in order avoid a co-linearity problem

In [19]:
#Create a 'hsc_s_Commerce' classification for our data for Commerice and also for Sciences
#Commerce Classificaiton
df_raw_hsc_s_commerce = np.array(df_raw['hsc_s'] == 'Commerce')
df_raw_hsc_s_commerce = df_raw_hsc_s_commerce.astype(int)
#Science Classification
df_raw_hsc_s_science = np.array(df_raw['hsc_s'] == 'Science')
df_raw_hsc_s_science = df_raw_hsc_s_science.astype(int)

In [20]:
#Next, we add these columns to our dataframe
df_raw.insert(7,'hsc_s_commerce',df_raw_hsc_s_commerce)
df_raw.insert(8,'hsc_s_science',df_raw_hsc_s_science)
df_raw.head(10)

Unnamed: 0,sl_no,female,ssc_p,ssc_Central,hsc_p,hsc_Central,hsc_s,hsc_s_commerce,hsc_s_science,degree_p,degree_t,work_experience,etest_p,specialisation,mba_p,employed,salary
0,1,0,67.0,0,91.0,0,Commerce,1,0,58.0,Sci&Tech,0,55.0,Mkt&HR,58.8,1,270000.0
1,2,0,79.33,1,78.33,0,Science,0,1,77.48,Sci&Tech,1,86.5,Mkt&Fin,66.28,1,200000.0
2,3,0,65.0,1,68.0,1,Arts,0,0,64.0,Comm&Mgmt,0,75.0,Mkt&Fin,57.8,1,250000.0
3,4,0,56.0,1,52.0,1,Science,0,1,52.0,Sci&Tech,0,66.0,Mkt&HR,59.43,0,
4,5,0,85.8,1,73.6,1,Commerce,1,0,73.3,Comm&Mgmt,0,96.8,Mkt&Fin,55.5,1,425000.0
5,6,0,55.0,0,49.8,0,Science,0,1,67.25,Sci&Tech,1,55.0,Mkt&Fin,51.58,0,
6,7,1,46.0,0,49.2,0,Commerce,1,0,79.0,Comm&Mgmt,0,74.28,Mkt&Fin,53.29,0,
7,8,0,82.0,1,64.0,1,Science,0,1,66.0,Sci&Tech,1,67.0,Mkt&Fin,62.14,1,252000.0
8,9,0,73.0,1,79.0,1,Commerce,1,0,72.0,Comm&Mgmt,0,91.34,Mkt&Fin,61.29,1,231000.0
9,10,0,58.0,1,70.0,1,Commerce,1,0,61.0,Comm&Mgmt,0,54.0,Mkt&Fin,52.21,0,


We'll keep the 'hsc_s' column for now as we are simply in the data munging & data wrangling process. If it proves to be more efficient to cut this column, we can remove it. For now, we can simply treat it as a 'key' related to the columns we have inserted

In [21]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 215 entries, 0 to 214
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   sl_no            215 non-null    int64  
 1   female           215 non-null    int32  
 2   ssc_p            215 non-null    float64
 3   ssc_Central      215 non-null    int32  
 4   hsc_p            215 non-null    float64
 5   hsc_Central      215 non-null    int32  
 6   hsc_s            215 non-null    object 
 7   hsc_s_commerce   215 non-null    int32  
 8   hsc_s_science    215 non-null    int32  
 9   degree_p         215 non-null    float64
 10  degree_t         215 non-null    object 
 11  work_experience  215 non-null    int32  
 12  etest_p          215 non-null    float64
 13  specialisation   215 non-null    object 
 14  mba_p            215 non-null    float64
 15  employed         215 non-null    int32  
 16  salary           148 non-null    float64
dtypes: float64(6), i

It would appear there are two more categories that we may need to explore a little to see if they need to be re-formatted, which are 'degree_t', and 'specialisation'

> *For reader convenience, I'll do these without much explication as they are similar to what we did above*

In [22]:
#Investigate the properties and entries of 'degree_t' characteristic
print(df_raw['degree_t'])
print(df_raw['degree_t'].dtype)
print(df_raw['degree_t'].unique())

0       Sci&Tech
1       Sci&Tech
2      Comm&Mgmt
3       Sci&Tech
4      Comm&Mgmt
         ...    
210    Comm&Mgmt
211     Sci&Tech
212    Comm&Mgmt
213    Comm&Mgmt
214    Comm&Mgmt
Name: degree_t, Length: 215, dtype: object
object
['Sci&Tech' 'Comm&Mgmt' 'Others']


In [23]:
print(df_raw['degree_t'].value_counts())

Comm&Mgmt    145
Sci&Tech      59
Others        11
Name: degree_t, dtype: int64


In [24]:
#Create a 'degree_t_comm_mgmt' classification for our data for Commerice and also for Sciences
#Comm&Mgmt Classificaiton
df_raw_degree_t_comm_mgmt = np.array(df_raw['degree_t'] == 'Comm&Mgmt')
df_raw_degree_t_comm_mgmt = df_raw_degree_t_comm_mgmt.astype(int)
#Sci&Tech Classification
df_raw_degree_t_sci_tech = np.array(df_raw['degree_t'] == 'Sci&Tech')
df_raw_degree_t_sci_tech = df_raw_degree_t_sci_tech.astype(int)

In [25]:
#Next, we add these columns to our dataframe
df_raw.insert(11,'degree_t_comm_mgmt',df_raw_degree_t_comm_mgmt)
df_raw.insert(12,'degree_t_sci_tech',df_raw_degree_t_sci_tech)
df_raw.head(10)

Unnamed: 0,sl_no,female,ssc_p,ssc_Central,hsc_p,hsc_Central,hsc_s,hsc_s_commerce,hsc_s_science,degree_p,degree_t,degree_t_comm_mgmt,degree_t_sci_tech,work_experience,etest_p,specialisation,mba_p,employed,salary
0,1,0,67.0,0,91.0,0,Commerce,1,0,58.0,Sci&Tech,0,1,0,55.0,Mkt&HR,58.8,1,270000.0
1,2,0,79.33,1,78.33,0,Science,0,1,77.48,Sci&Tech,0,1,1,86.5,Mkt&Fin,66.28,1,200000.0
2,3,0,65.0,1,68.0,1,Arts,0,0,64.0,Comm&Mgmt,1,0,0,75.0,Mkt&Fin,57.8,1,250000.0
3,4,0,56.0,1,52.0,1,Science,0,1,52.0,Sci&Tech,0,1,0,66.0,Mkt&HR,59.43,0,
4,5,0,85.8,1,73.6,1,Commerce,1,0,73.3,Comm&Mgmt,1,0,0,96.8,Mkt&Fin,55.5,1,425000.0
5,6,0,55.0,0,49.8,0,Science,0,1,67.25,Sci&Tech,0,1,1,55.0,Mkt&Fin,51.58,0,
6,7,1,46.0,0,49.2,0,Commerce,1,0,79.0,Comm&Mgmt,1,0,0,74.28,Mkt&Fin,53.29,0,
7,8,0,82.0,1,64.0,1,Science,0,1,66.0,Sci&Tech,0,1,1,67.0,Mkt&Fin,62.14,1,252000.0
8,9,0,73.0,1,79.0,1,Commerce,1,0,72.0,Comm&Mgmt,1,0,0,91.34,Mkt&Fin,61.29,1,231000.0
9,10,0,58.0,1,70.0,1,Commerce,1,0,61.0,Comm&Mgmt,1,0,0,54.0,Mkt&Fin,52.21,0,


In [26]:
#Investigate the properties and entries of 'degree_t' characteristic
print(df_raw['specialisation'])
print(df_raw['specialisation'].dtype)
print(df_raw['specialisation'].unique())

0       Mkt&HR
1      Mkt&Fin
2      Mkt&Fin
3       Mkt&HR
4      Mkt&Fin
        ...   
210    Mkt&Fin
211    Mkt&Fin
212    Mkt&Fin
213     Mkt&HR
214     Mkt&HR
Name: specialisation, Length: 215, dtype: object
object
['Mkt&HR' 'Mkt&Fin']


Interestingly - all the specializations basically boil down to either HR or Finanace as each one apparently also includes Marketing. As such, we can simply create two columns, one for HR and one for Finance (with the understanding that we'd only include one in any analysis to avoid co-linearlity issues)

In [27]:
print(df_raw['specialisation'].value_counts())

Mkt&Fin    120
Mkt&HR      95
Name: specialisation, dtype: int64


In [28]:
#Create a 'specialisation_finance' classification for our data for finance and HR
#Finance Classificaiton
df_raw_specialisation_finance = np.array(df_raw['specialisation'] == 'Mkt&Fin')
df_raw_specialisation_finance = df_raw_specialisation_finance.astype(int)
#HR Classification
df_raw_specialisation_hr = np.array(df_raw['specialisation'] == 'Mkt&HR')
df_raw_specialisation_hr = df_raw_specialisation_hr.astype(int)

In [29]:
#Next, we add these columns to our dataframe
df_raw.insert(16,'specialisation_finance',df_raw_specialisation_finance)
df_raw.insert(17,'specialisation_hr',df_raw_specialisation_hr)
df_raw.head(10)

Unnamed: 0,sl_no,female,ssc_p,ssc_Central,hsc_p,hsc_Central,hsc_s,hsc_s_commerce,hsc_s_science,degree_p,...,degree_t_comm_mgmt,degree_t_sci_tech,work_experience,etest_p,specialisation,specialisation_finance,specialisation_hr,mba_p,employed,salary
0,1,0,67.0,0,91.0,0,Commerce,1,0,58.0,...,0,1,0,55.0,Mkt&HR,0,1,58.8,1,270000.0
1,2,0,79.33,1,78.33,0,Science,0,1,77.48,...,0,1,1,86.5,Mkt&Fin,1,0,66.28,1,200000.0
2,3,0,65.0,1,68.0,1,Arts,0,0,64.0,...,1,0,0,75.0,Mkt&Fin,1,0,57.8,1,250000.0
3,4,0,56.0,1,52.0,1,Science,0,1,52.0,...,0,1,0,66.0,Mkt&HR,0,1,59.43,0,
4,5,0,85.8,1,73.6,1,Commerce,1,0,73.3,...,1,0,0,96.8,Mkt&Fin,1,0,55.5,1,425000.0
5,6,0,55.0,0,49.8,0,Science,0,1,67.25,...,0,1,1,55.0,Mkt&Fin,1,0,51.58,0,
6,7,1,46.0,0,49.2,0,Commerce,1,0,79.0,...,1,0,0,74.28,Mkt&Fin,1,0,53.29,0,
7,8,0,82.0,1,64.0,1,Science,0,1,66.0,...,0,1,1,67.0,Mkt&Fin,1,0,62.14,1,252000.0
8,9,0,73.0,1,79.0,1,Commerce,1,0,72.0,...,1,0,0,91.34,Mkt&Fin,1,0,61.29,1,231000.0
9,10,0,58.0,1,70.0,1,Commerce,1,0,61.0,...,1,0,0,54.0,Mkt&Fin,1,0,52.21,0,


We now have to take a look at employment and salary verify the pattern we can potential see. Specifically, it appears that if 'employed' is 1 (meaning the student found employment), then a salary is also provided. However, if 'employed' is 0 (meaning the student has not found employment), then it appears the salary 'NaN'. We can check this pattern to make sure there are no errors

In [30]:
#Check to see if there is any break in the pattern of employed = 1 and salary *or* employed = 0 and salary = NaN
print(df_raw['employed'].value_counts())
df_raw[['employed','salary']].groupby('employed').count()

1    148
0     67
Name: employed, dtype: int64


Unnamed: 0_level_0,salary
employed,Unnamed: 1_level_1
0,0
1,148


Based on the output, it would appear that the pattern holds, so we can safely assume that if a salary is not provided, it is because the student is not employed *and* it would appear that for every student that was hired, they did provide their salary so we don't need to worry about students who are employed but did not provide their salary.

Based on feedback from our mentor, it was proposed that we should consider a few more things for our data. Priamrily, we should convert the null values in our data to zeroes for easier processing in the salary column and do some analyses to figure out if we have outliers. We'll first change our nulls into zeroes for the "salary" characteristics.

In [31]:
#Let's fill in our NaN's in the 'salary' characteristics
df_raw_salary = df_raw['salary']
df_raw_salary = df_raw_salary.fillna(0)

In [32]:
#Now let's insert out new column to make it clean for our salary information
df_raw['salary'] = df_raw_salary
df_raw

Unnamed: 0,sl_no,female,ssc_p,ssc_Central,hsc_p,hsc_Central,hsc_s,hsc_s_commerce,hsc_s_science,degree_p,...,degree_t_comm_mgmt,degree_t_sci_tech,work_experience,etest_p,specialisation,specialisation_finance,specialisation_hr,mba_p,employed,salary
0,1,0,67.00,0,91.00,0,Commerce,1,0,58.00,...,0,1,0,55.0,Mkt&HR,0,1,58.80,1,270000.0
1,2,0,79.33,1,78.33,0,Science,0,1,77.48,...,0,1,1,86.5,Mkt&Fin,1,0,66.28,1,200000.0
2,3,0,65.00,1,68.00,1,Arts,0,0,64.00,...,1,0,0,75.0,Mkt&Fin,1,0,57.80,1,250000.0
3,4,0,56.00,1,52.00,1,Science,0,1,52.00,...,0,1,0,66.0,Mkt&HR,0,1,59.43,0,0.0
4,5,0,85.80,1,73.60,1,Commerce,1,0,73.30,...,1,0,0,96.8,Mkt&Fin,1,0,55.50,1,425000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
210,211,0,80.60,0,82.00,0,Commerce,1,0,77.60,...,1,0,0,91.0,Mkt&Fin,1,0,74.49,1,400000.0
211,212,0,58.00,0,60.00,0,Science,0,1,72.00,...,0,1,0,74.0,Mkt&Fin,1,0,53.62,1,275000.0
212,213,0,67.00,0,67.00,0,Commerce,1,0,73.00,...,1,0,1,59.0,Mkt&Fin,1,0,69.72,1,295000.0
213,214,1,74.00,0,66.00,0,Commerce,1,0,58.00,...,1,0,0,70.0,Mkt&HR,0,1,60.23,1,204000.0


Now we'll consider whether we have outliers for our data. We'll only consider the variables that have some sort of range associated with them as categorial variables would not be appropriate. These would be:
* ssc_p
* hsc_p
* degree_p
* etest_p
* mba_p
* salary

To do this, we'll first have to define what an outlier will be for us. We will use the Interquartile Range method where we first take the IQR * 1.5  as a constant (K). We will then identify any value less than the 25th percentile * K or 75th percentile * K as an outlier.

In [33]:
#First, let's see the values we need to define our IQRs to get sense
df_raw.describe()

Unnamed: 0,sl_no,female,ssc_p,ssc_Central,hsc_p,hsc_Central,hsc_s_commerce,hsc_s_science,degree_p,degree_t_comm_mgmt,degree_t_sci_tech,work_experience,etest_p,specialisation_finance,specialisation_hr,mba_p,employed,salary
count,215.0,215.0,215.0,215.0,215.0,215.0,215.0,215.0,215.0,215.0,215.0,215.0,215.0,215.0,215.0,215.0,215.0,215.0
mean,108.0,0.353488,67.303395,0.539535,66.333163,0.390698,0.525581,0.423256,66.370186,0.674419,0.274419,0.344186,72.100558,0.55814,0.44186,62.278186,0.688372,198702.325581
std,62.209324,0.479168,10.827205,0.499598,10.897509,0.489045,0.50051,0.495228,7.358743,0.469685,0.447262,0.476211,13.275956,0.497767,0.497767,5.833385,0.46424,154780.926716
min,1.0,0.0,40.89,0.0,37.0,0.0,0.0,0.0,50.0,0.0,0.0,0.0,50.0,0.0,0.0,51.21,0.0,0.0
25%,54.5,0.0,60.6,0.0,60.9,0.0,0.0,0.0,61.0,0.0,0.0,0.0,60.0,0.0,0.0,57.945,0.0,0.0
50%,108.0,0.0,67.0,1.0,65.0,0.0,1.0,0.0,66.0,1.0,0.0,0.0,71.0,1.0,0.0,62.0,1.0,240000.0
75%,161.5,1.0,75.7,1.0,73.0,1.0,1.0,1.0,72.0,1.0,1.0,1.0,83.5,1.0,1.0,66.255,1.0,282500.0
max,215.0,1.0,89.4,1.0,97.7,1.0,1.0,1.0,91.0,1.0,1.0,1.0,98.0,1.0,1.0,77.89,1.0,940000.0


In [34]:
#Let's work through the ssc_p portion step by step to get the numbers
ssc_p_Q3 = np.percentile(df_raw['ssc_p'],75)
ssc_p_Q1 = np.percentile(df_raw['ssc_p'],25)
ssc_p_IQR = ssc_p_Q3 - ssc_p_Q1
ssc_p_IQR_K = ssc_p_IQR * 1.5

#Next, we want to see if there are any observations in the ssc_p that would be considered outliers
df_raw_ssc_p_outliers = df_raw.loc[(df_raw['ssc_p'] < (ssc_p_Q1 - ssc_p_IQR_K)) | (df_raw['ssc_p'] > (ssc_p_Q3 + ssc_p_IQR_K))]
df_raw_ssc_p_outliers

Unnamed: 0,sl_no,female,ssc_p,ssc_Central,hsc_p,hsc_Central,hsc_s,hsc_s_commerce,hsc_s_science,degree_p,...,degree_t_comm_mgmt,degree_t_sci_tech,work_experience,etest_p,specialisation,specialisation_finance,specialisation_hr,mba_p,employed,salary


We can see from the IQR defined outlier check that the "ssc_p" category has no outliers to consider (yay!).

>*As a note, I'm going to work through the others quickly

In [35]:
#Let's work through the hsc_p portion step by step to get the numbers
hsc_p_Q3 = np.percentile(df_raw['hsc_p'],75)
hsc_p_Q1 = np.percentile(df_raw['hsc_p'],25)
hsc_p_IQR = hsc_p_Q3 - hsc_p_Q1
hsc_p_IQR_K = hsc_p_IQR * 1.5

#Next, we want to see if there are any observations in the hsc_p that would be considered outliers
df_raw_hsc_p_outliers = df_raw.loc[(df_raw['hsc_p'] < (hsc_p_Q1 - hsc_p_IQR_K)) | (df_raw['hsc_p'] > (hsc_p_Q3 + hsc_p_IQR_K))]
df_raw_hsc_p_outliers

Unnamed: 0,sl_no,female,ssc_p,ssc_Central,hsc_p,hsc_Central,hsc_s,hsc_s_commerce,hsc_s_science,degree_p,...,degree_t_comm_mgmt,degree_t_sci_tech,work_experience,etest_p,specialisation,specialisation_finance,specialisation_hr,mba_p,employed,salary
24,25,0,76.5,0,97.7,0,Science,0,1,78.86,...,0,1,0,97.4,Mkt&Fin,1,0,74.01,1,360000.0
42,43,0,49.0,0,39.0,1,Science,0,1,65.0,...,0,0,0,63.0,Mkt&Fin,1,0,51.21,0,0.0
49,50,1,50.0,0,37.0,0,Arts,0,0,52.0,...,0,0,0,65.0,Mkt&HR,0,1,56.11,0,0.0
120,121,0,58.0,0,40.0,0,Science,0,1,59.0,...,1,0,0,73.0,Mkt&HR,0,1,58.81,0,0.0
134,135,1,77.44,1,92.0,0,Commerce,1,0,72.0,...,1,0,1,94.0,Mkt&Fin,1,0,67.13,1,250000.0
169,170,0,59.96,0,42.16,0,Science,0,1,61.26,...,0,1,0,54.48,Mkt&HR,0,1,65.48,0,0.0
177,178,1,73.0,1,97.0,0,Commerce,1,0,79.0,...,1,0,1,89.0,Mkt&Fin,1,0,70.81,1,650000.0
206,207,0,41.0,1,42.0,1,Science,0,1,60.0,...,1,0,0,97.0,Mkt&Fin,1,0,53.39,0,0.0


For the hsc_p characteristics, we can see that 8 observations would meet the criteria of being an outlier. As such, we should label these observations as "hsc_p" outliers for clarity and potentially future filtering

In [36]:
##Create a series to insert into data set that indentifies if a "hsc_p" observation is techincally an outlier
df_raw_hsc_p_outliers_insert = df_raw["hsc_p"]
df_raw_hsc_p_outliers_insert = (df_raw_hsc_p_outliers_insert < (hsc_p_Q1 - hsc_p_IQR_K)) | (df_raw_hsc_p_outliers_insert > (hsc_p_Q3 + hsc_p_IQR_K))
df_raw_hsc_p_outliers_insert = df_raw_hsc_p_outliers_insert.astype(int)

#Insert into data set
df_raw.insert(5,'hsc_p_outlier',df_raw_hsc_p_outliers_insert)
df_raw.head()

Unnamed: 0,sl_no,female,ssc_p,ssc_Central,hsc_p,hsc_p_outlier,hsc_Central,hsc_s,hsc_s_commerce,hsc_s_science,...,degree_t_comm_mgmt,degree_t_sci_tech,work_experience,etest_p,specialisation,specialisation_finance,specialisation_hr,mba_p,employed,salary
0,1,0,67.0,0,91.0,0,0,Commerce,1,0,...,0,1,0,55.0,Mkt&HR,0,1,58.8,1,270000.0
1,2,0,79.33,1,78.33,0,0,Science,0,1,...,0,1,1,86.5,Mkt&Fin,1,0,66.28,1,200000.0
2,3,0,65.0,1,68.0,0,1,Arts,0,0,...,1,0,0,75.0,Mkt&Fin,1,0,57.8,1,250000.0
3,4,0,56.0,1,52.0,0,1,Science,0,1,...,0,1,0,66.0,Mkt&HR,0,1,59.43,0,0.0
4,5,0,85.8,1,73.6,0,1,Commerce,1,0,...,1,0,0,96.8,Mkt&Fin,1,0,55.5,1,425000.0


In [37]:
#Let's work through the degree_p portion step by step to get the numbers
degree_p_Q3 = np.percentile(df_raw['degree_p'],75)
degree_p_Q1 = np.percentile(df_raw['degree_p'],25)
degree_p_IQR = degree_p_Q3 - degree_p_Q1
degree_p_IQR_K = degree_p_IQR * 1.5

#Next, we want to see if there are any observations in the degree_p that would be considered outliers
df_raw_degree_p_outliers = df_raw.loc[(df_raw['degree_p'] < (degree_p_Q1 - degree_p_IQR_K)) | (df_raw['degree_p'] > (degree_p_Q3 + degree_p_IQR_K))]
df_raw_degree_p_outliers

Unnamed: 0,sl_no,female,ssc_p,ssc_Central,hsc_p,hsc_p_outlier,hsc_Central,hsc_s,hsc_s_commerce,hsc_s_science,...,degree_t_comm_mgmt,degree_t_sci_tech,work_experience,etest_p,specialisation,specialisation_finance,specialisation_hr,mba_p,employed,salary
197,198,1,83.96,0,53.0,0,0,Science,0,1,...,0,1,0,59.32,Mkt&HR,0,1,69.71,1,260000.0


It would appear that the "degree_p" characteristic has a single outlier, so we'll once again create a flag for it.

In [38]:
##Create a series to insert into data set that indentifies if a "degree_p" observation is techincally an outlier
df_raw_degree_p_outliers_insert = df_raw["degree_p"]
df_raw_degree_p_outliers_insert = (df_raw_degree_p_outliers_insert < (degree_p_Q1 - degree_p_IQR_K)) | (df_raw_degree_p_outliers_insert > (degree_p_Q3 + degree_p_IQR_K))
df_raw_degree_p_outliers_insert = df_raw_degree_p_outliers_insert.astype(int)

#Insert into data set
df_raw.insert(11,'degree_p_outlier',df_raw_degree_p_outliers_insert)

Next, we will check the etest_p characteristics to see if there are any outliers

In [39]:
#Let's work through the etest_p portion step by step to get the numbers
etest_p_Q3 = np.percentile(df_raw['etest_p'],75)
etest_p_Q1 = np.percentile(df_raw['etest_p'],25)
etest_p_IQR = etest_p_Q3 - etest_p_Q1
etest_p_IQR_K = etest_p_IQR * 1.5

#Next, we want to see if there are any observations in the etest_p that would be considered outliers
df_raw_etest_p_outliers = df_raw.loc[(df_raw['etest_p'] < (etest_p_Q1 - etest_p_IQR_K)) | (df_raw['etest_p'] > (etest_p_Q3 + etest_p_IQR_K))]
df_raw_etest_p_outliers

Unnamed: 0,sl_no,female,ssc_p,ssc_Central,hsc_p,hsc_p_outlier,hsc_Central,hsc_s,hsc_s_commerce,hsc_s_science,...,degree_t_comm_mgmt,degree_t_sci_tech,work_experience,etest_p,specialisation,specialisation_finance,specialisation_hr,mba_p,employed,salary


The etest_p category appears to have no outliers! Next we'll check the mba_p category

In [40]:
#Let's work through the mba_p portion step by step to get the numbers
mba_p_Q3 = np.percentile(df_raw['mba_p'],75)
mba_p_Q1 = np.percentile(df_raw['mba_p'],25)
mba_p_IQR = mba_p_Q3 - mba_p_Q1
mba_p_IQR_K = mba_p_IQR * 1.5

#Next, we want to see if there are any observations in the mba_p that would be considered outliers
df_raw_mba_p_outliers = df_raw.loc[(df_raw['mba_p'] < (mba_p_Q1 - mba_p_IQR_K)) | (df_raw['mba_p'] > (mba_p_Q3 + mba_p_IQR_K))]
df_raw_mba_p_outliers

Unnamed: 0,sl_no,female,ssc_p,ssc_Central,hsc_p,hsc_p_outlier,hsc_Central,hsc_s,hsc_s_commerce,hsc_s_science,...,degree_t_comm_mgmt,degree_t_sci_tech,work_experience,etest_p,specialisation,specialisation_finance,specialisation_hr,mba_p,employed,salary


The "mba_p" category has no outliers which is great as well! The final category to check is the "salary" category.

The "salary" category is a little different because there are a lot of zeroes but those zeroes are tied to a specific situations (where a person didn't get a job.) As such, we will look at doing the IQR outlier test using only the observations that have a salary above zero.

In [41]:
#First, let's make a series that contains only the salaries above zero
salary_gt0 = df_raw.loc[df_raw['salary'] > 0,:]
salary_gt0['salary'].unique()

array([270000., 200000., 250000., 425000., 252000., 231000., 260000.,
       218000., 300000., 236000., 265000., 393000., 360000., 240000.,
       350000., 278000., 320000., 411000., 287000., 204000., 450000.,
       216000., 220000., 268000., 275000., 336000., 230000., 500000.,
       400000., 210000., 420000., 380000., 280000., 276000., 940000.,
       225000., 233000., 690000., 340000., 255000., 285000., 290000.,
       650000., 264000., 295000.])

In [42]:
#Based on the 'unique' check, we can be confident that only observations greater than zero are in the series. From here, we will proceed with the IQR test
#Let's work through the salary portion step by step to get the numbers
salary_gt0_Q3 = np.percentile(salary_gt0['salary'],75)
salary_gt0_Q1 = np.percentile(salary_gt0['salary'],25)
salary_gt0_IQR = salary_gt0_Q3 - salary_gt0_Q1
salary_gt0_IQR_K = salary_gt0_IQR * 1.5

#Next, we want to see if there are any observations in the salary_gt0 that would be considered outliers
df_raw_salary_gt0_outliers = salary_gt0.loc[(salary_gt0['salary'] < (salary_gt0_Q1 - salary_gt0_IQR_K)) | (salary_gt0['salary'] > (salary_gt0_Q3 + salary_gt0_IQR_K))]
df_raw_salary_gt0_outliers

Unnamed: 0,sl_no,female,ssc_p,ssc_Central,hsc_p,hsc_p_outlier,hsc_Central,hsc_s,hsc_s_commerce,hsc_s_science,...,degree_t_comm_mgmt,degree_t_sci_tech,work_experience,etest_p,specialisation,specialisation_finance,specialisation_hr,mba_p,employed,salary
4,5,0,85.8,1,73.6,0,1,Commerce,1,0,...,1,0,0,96.8,Mkt&Fin,1,0,55.5,1,425000.0
21,22,1,79.0,0,76.0,0,0,Commerce,1,0,...,1,0,0,95.0,Mkt&Fin,1,0,69.06,1,393000.0
39,40,0,81.0,0,68.0,0,0,Science,0,1,...,0,1,0,93.0,Mkt&Fin,1,0,62.56,1,411000.0
53,54,0,80.0,0,70.0,0,0,Science,0,1,...,0,1,0,87.0,Mkt&HR,0,1,71.04,1,450000.0
77,78,0,64.0,0,80.0,0,0,Science,0,1,...,0,1,1,69.0,Mkt&Fin,1,0,57.65,1,500000.0
85,86,1,83.84,0,89.83,0,0,Commerce,1,0,...,1,0,1,78.74,Mkt&Fin,1,0,76.18,1,400000.0
95,96,0,73.0,1,78.0,0,0,Commerce,1,0,...,1,0,1,95.46,Mkt&Fin,1,0,62.16,1,420000.0
119,120,0,60.8,1,68.4,0,1,Commerce,1,0,...,1,0,1,82.66,Mkt&Fin,1,0,64.34,1,940000.0
128,129,0,80.4,1,73.4,0,1,Science,0,1,...,0,1,1,81.2,Mkt&HR,0,1,76.26,1,400000.0
145,146,0,89.4,0,65.66,0,0,Science,0,1,...,0,1,0,72.0,Mkt&HR,0,1,63.23,1,400000.0


We can see here that there are a fair amount of observations that are considered outliers. For now, we'll simply flag them as outliers and see if it makes sense to worry about them down the road. At least intitially, it would appear that none of the values appear to raise any concerns.

In [43]:
#First, we need to create an series that is zeroes and ones for the given indicies
salary_gt0_outliers = [0] * len(df_raw)

#Then we need a series of the indices
salary_gt0_outliers_index = list(df_raw_salary_gt0_outliers.index)

#Then we can iterate through it and change values accordingly
for counter in range(len(df_raw)):
    if counter in salary_gt0_outliers_index:
        salary_gt0_outliers[counter] = 1
    else:
        salary_gt0_outliers[counter] = 0

#We then insert it into a data frame so we have the flag ready for use!
df_raw = df_raw.assign(salary_gt0_outliers=salary_gt0_outliers)
df_raw

Unnamed: 0,sl_no,female,ssc_p,ssc_Central,hsc_p,hsc_p_outlier,hsc_Central,hsc_s,hsc_s_commerce,hsc_s_science,...,degree_t_sci_tech,work_experience,etest_p,specialisation,specialisation_finance,specialisation_hr,mba_p,employed,salary,salary_gt0_outliers
0,1,0,67.00,0,91.00,0,0,Commerce,1,0,...,1,0,55.0,Mkt&HR,0,1,58.80,1,270000.0,0
1,2,0,79.33,1,78.33,0,0,Science,0,1,...,1,1,86.5,Mkt&Fin,1,0,66.28,1,200000.0,0
2,3,0,65.00,1,68.00,0,1,Arts,0,0,...,0,0,75.0,Mkt&Fin,1,0,57.80,1,250000.0,0
3,4,0,56.00,1,52.00,0,1,Science,0,1,...,1,0,66.0,Mkt&HR,0,1,59.43,0,0.0,0
4,5,0,85.80,1,73.60,0,1,Commerce,1,0,...,0,0,96.8,Mkt&Fin,1,0,55.50,1,425000.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
210,211,0,80.60,0,82.00,0,0,Commerce,1,0,...,0,0,91.0,Mkt&Fin,1,0,74.49,1,400000.0,1
211,212,0,58.00,0,60.00,0,0,Science,0,1,...,1,0,74.0,Mkt&Fin,1,0,53.62,1,275000.0,0
212,213,0,67.00,0,67.00,0,0,Commerce,1,0,...,0,1,59.0,Mkt&Fin,1,0,69.72,1,295000.0,0
213,214,1,74.00,0,66.00,0,0,Commerce,1,0,...,0,0,70.0,Mkt&HR,0,1,60.23,1,204000.0,0


In [44]:
#We Finally at the end export our new processed DataFrame for future use!
#Export the work as a csv for future use
df_raw.to_csv('Placement_Data_Full_Class_CLEANED')

Concluding Remarks:
For something that may seem relatively simple, there is actually *a lot* of careful consideration I'm learning when it comes to data wrangling, and that's with a relatively clean data set. That said, I feel like I learned a lot and want to get more efficient at my ability to parse through things like this. Overall, I hope this hits the marks of what you're looking for!

Cheers!
Emre