# Prediction Model Environment/Template

### By Chuan Yong (Bill) Guo

The purpose of this file is to act as a template to test various machine learning models and document possible options and parameters that can be manipulated for each model in a concise and easy to follow format. Each line of code is attached to a '#' symbol in front of it to prevent the code from running as multiple codes within a code cell because some codes may not be allowed to run together. In order to run the line of code, remove the '#' symbol on front of the desired code line. The purpose of this setup package multiple concepts within a single code cell for easy access and understanding, without having the cell run into errors trying to run multiple contradictory codes.

In [1]:
#This cell is used to import modules that will be useful for model prediction. 

import pandas as pd #Pandas is the base Python library used to manuipulate datasets, typically this will always be useful. 
import numpy as np #Numpy is mainly used to allow for mathematical manipulations of datasets, typically this will always be useful.



In [2]:
#Importing datasets
df_original = pd.read_csv('Datasets/salarysample.csv') #Used to read .csv files.
#df2 = pd.read_excel('Datasets/Global Superstore.xls', sheet_name='People') #Used to read .xls files, sheet_name can be used to select for different sheets by using index number or string name.
df_original.head() #Initial dataset without any manipulation.

Unnamed: 0,index,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,...,tensor,hadoop,tableau,bi,flink,mongo,google_an,job_title_sim,seniority_by_title,Degree
0,0,Data Scientist,$53K-$91K (Glassdoor est.),"Data Scientist\nLocation: Albuquerque, NM\nEdu...",3.8,Tecolote Research\n3.8,"Albuquerque, NM","Goleta, CA",501 - 1000,1973,...,0,0,1,1,0,0,0,data scientist,na,M
1,1,Healthcare Data Scientist,$63K-$112K (Glassdoor est.),What You Will Do:\n\nI. General Summary\n\nThe...,3.4,University of Maryland Medical System\n3.4,"Linthicum, MD","Baltimore, MD",10000+,1984,...,0,0,0,0,0,0,0,data scientist,na,M
2,2,Data Scientist,$80K-$90K (Glassdoor est.),"KnowBe4, Inc. is a high growth information sec...",4.8,KnowBe4\n4.8,"Clearwater, FL","Clearwater, FL",501 - 1000,2010,...,0,0,0,0,0,0,0,data scientist,na,M
3,3,Data Scientist,$56K-$97K (Glassdoor est.),*Organization and Job ID**\nJob ID: 310709\n\n...,3.8,PNNL\n3.8,"Richland, WA","Richland, WA",1001 - 5000,1965,...,0,0,0,0,0,0,0,data scientist,na,na
4,4,Data Scientist,$86K-$143K (Glassdoor est.),Data Scientist\nAffinity Solutions / Marketing...,2.9,Affinity Solutions\n2.9,"New York, NY","New York, NY",51 - 200,1998,...,0,0,0,0,0,0,0,data scientist,na,na


### Basic Column and Row Manipulation

In [28]:
###'''Data Cleaning'''

df = df_original #Good idea to create a new object that is a copy of the original dataset so that the original dataset isn't directly manipulated

###'''Declaring and resetting columns as index'''
#df.set_index('Job Title',inplace=True)
#df.reset_index(inplace=True)

###'''Adding a column based on other columns'''
#df['New column'] = (df['Lower Salary'] + df['Upper Salary'])/2

###'''Removing columns'''
#df.drop(columns={'Rating','Location'},inplace=True) #By column name.
#df.drop(df.iloc[:, 1:3],axis=1,inplace=True) #By column index with range of columns (axis=1 indicates columns)

###'''Viewing multiple columns'''
#df[['New column','Job Title']] #By column name
#df.loc[:,'Rating':'Location'] #By column name(another format)
#df.iloc[:,[1,2,5]] #By column index
#df.iloc[:,1:5] #By column index with range of columns

###'''Adding a row'''
#df.loc[len(df)] = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41] #Add new row at the end of dataframe, or any row by specifying index number, number of elements must match column number

###'''Removing rows'''
#df.drop([0,1])

###'''Viewing multiple rows'''
#df.iloc[3:9,1:5]



Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating
3,Data Scientist,$56K-$97K (Glassdoor est.),*Organization and Job ID**\nJob ID: 310709\n\n...,3.8
4,Data Scientist,$86K-$143K (Glassdoor est.),Data Scientist\nAffinity Solutions / Marketing...,2.9
5,Data Scientist,$71K-$119K (Glassdoor est.),CyrusOne is seeking a talented Data Scientist ...,3.4
6,Data Scientist,$54K-$93K (Glassdoor est.),Job Description\n\n**Please only local candida...,4.1
7,Data Scientist,$86K-$142K (Glassdoor est.),Advanced Analytics – Lead Data Scientist\nOver...,3.8
8,Research Scientist,$38K-$84K (Glassdoor est.),SUMMARY\n\nThe Research Scientist I will be ta...,3.3


### Data values/strings Manipulation

In [30]:
###'''Visualizing datatypes and null values each column contains'''
#df.info()

###'''Imputation (Replace null data with something or removing it completely)'''


<class 'pandas.core.frame.DataFrame'>
Int64Index: 744 entries, 0 to 743
Data columns (total 42 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   index               744 non-null    int64  
 1   Job Title           744 non-null    object 
 2   Salary Estimate     744 non-null    object 
 3   Job Description     744 non-null    object 
 4   Rating              744 non-null    float64
 5   Company Name        744 non-null    object 
 6   Location            744 non-null    object 
 7   Headquarters        744 non-null    object 
 8   Size                744 non-null    object 
 9   Founded             744 non-null    int64  
 10  Type of ownership   744 non-null    object 
 11  Industry            744 non-null    object 
 12  Sector              744 non-null    object 
 13  Revenue             744 non-null    object 
 14  Competitors         744 non-null    object 
 15  Hourly              744 non-null    int64  
 16  Employer

### Splitting data for testing and training, test models