<h1>Regression Model Development</h1>

<em><strong>Roberto Zevallos</strong></em>

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

To start, I imported the birth weight Excel file and I checked the size of my dataset, where I identified that it doesn't have many observations (196) and 18 features.
<br>
Then the .info() function helped me check the information of my data, like validation of data types. By using the .isnull() function I identified missing values in three features (father's education, number of prenatal visits and mother's education).
<br><br>

In [None]:
# importing libraries
import pandas as pd # data science essentials
import numpy as np # mathematical essentials
import matplotlib.pyplot as plt # essential graphical output
import seaborn as sns # enhanced graphical output
import statsmodels.formula.api as smf # predictive modeling with nice outputs

# setting file name
file = './birthweight_low.xlsx'

birthweight = pd.read_excel(io         = file,
                            header     = 0,
                            sheet_name = 0)

In [None]:
print(f"""
Size of Original Dataset
------------------------
Observations: {birthweight.shape[0]}
Features:     {birthweight.shape[1]}
""")

In [None]:
birthweight.info()

In [None]:
birthweight.isnull().sum().sort_values(ascending = False)

<br><br>
<strong>Flagging missing values</strong>
<br>
To point out the missing values I used a loop to flag them and added new columns at the end of the dataset with "1" were the missing values were found. These new columns are codified by adding a letter "m" from missing the beginning of their name.
Then a sum of missing values was done to validate that all the missing values were identified.
<br><br>

In [None]:
# flag the missing values
for column in birthweight:

    if birthweight[column].isnull().astype(int).sum() > 0:
        birthweight['m_' + column] = birthweight[column].isnull().astype(int)

In [None]:
# sum of missing values
print(f"""
Sums of Missing Value Flags
--------------------------
{birthweight.iloc[ : , -3: ].sum(axis = 0)}

""")

<br><br>
<strong>Treatment of missing values</strong>
<br>
To treat the missing values I created a new DataFrame were I dropped the previously identified missing values. This was done to identify the mode and median of this dataset and to be able to compare the new data once I had imputed the missing values.
<br><br>
By looking at the histograms for each variable that had missing values, I decided that the best imputation method was to use the median for the missing values. I did not use the mean because it includes decimals and to make more sense with the data the median seems more appropriate. Also, there wasn't a big difference in the impact in the distribution between imputing with the mean or median. 
<br><br>
Further, by using the .fillna() function, the missing values were imputed with the median.
<br><br>

In [None]:
# Dropping the missing values and creating a new DataFrame
birthweight_dropped = pd.DataFrame.copy(birthweight)

birthweight_dropped = birthweight_dropped.dropna()

In [None]:
birthweight_dropped.describe(include = 'number').round(2)

In [None]:
# histogram for feduc
sns.histplot(data  = birthweight_dropped,
             x     ='feduc',
             bins  = 'fd',
             kde   = True, # kernel decimal estimate, draw the distribution
             color = 'gray')


# adding a title
plt.title(label = "Distribution of father education")


# adding an x-label
plt.xlabel(xlabel = 'Father education')


# adding a y-label
plt.ylabel(ylabel = 'Frequency')

# adding a mean and median line
plt.axvline(x = birthweight_dropped['feduc'].mean(),
            color = 'red')

plt.axvline(x = birthweight_dropped['feduc'].median(),
            color = 'blue')

# adding legend
plt.legend(labels =  ['bins','mean', 'median'])

# displaying the plot
plt.tight_layout()
plt.show()

########
# histogram for meduc
sns.histplot(data  = birthweight_dropped,
             x     ='meduc',
             bins  = 'fd',
             kde   = True, # kernel decimal estimate, draw the distribution
             color = 'gray')


# adding a title
plt.title(label = "Distribution of mother education")


# adding an x-label
plt.xlabel(xlabel = 'Mother education')


# adding a y-label
plt.ylabel(ylabel = 'Frequency')

# adding a mean and median line
plt.axvline(x = birthweight_dropped['meduc'].mean(),
            color = 'red')

plt.axvline(x = birthweight_dropped['meduc'].median(),
            color = 'blue')

# adding legend
plt.legend(labels =  ['bins','mean', 'median'])

# displaying the plot
plt.tight_layout()
plt.show()

#######
# histogram for npvis
sns.histplot(data  = birthweight_dropped,
             x     ='npvis',
             bins  = 'fd',
             kde   = True, # kernel decimal estimate, draw the distribution
             color = 'gray')


# adding a title
plt.title(label = "Distribution of number of prenatal visits")


# adding an x-label
plt.xlabel(xlabel = 'Prenatal visits')


# adding a y-label
plt.ylabel(ylabel = 'Frequency')

# adding a mean and median line
plt.axvline(x = birthweight_dropped['npvis'].mean(),
            color = 'red')

plt.axvline(x = birthweight_dropped['npvis'].median(),
            color = 'blue')

# adding legend
plt.legend(labels =  ['bins','mean', 'npvis'])

# displaying the plot
plt.tight_layout()
plt.show()

In [None]:
meduc_median = birthweight['meduc'].median()


# filling meduc NAs with MEDIAN
birthweight['meduc'].fillna(value = meduc_median,
                         inplace = True)


# checking to make sure NAs are filled in
print(birthweight['meduc'].isnull().any())


npvis_median = birthweight['npvis'].median()


# filling npvis NAs with MEDIAN
birthweight['npvis'].fillna(value = npvis_median,
                         inplace = True)


# checking to make sure NAs are filled in
print(birthweight['npvis'].isnull().any())



feduc_median = birthweight['feduc'].median()


# filling feduc NAs with MEDIAN
birthweight['feduc'].fillna(value = feduc_median,
                         inplace = True)


# checking to make sure NAs are filled in
print(birthweight['feduc'].isnull().any())

In [None]:
# validation of imputed values for feduc and comparing with the education and age from the mother
birthweight.loc[:,["meduc", "feduc", "mage"]][birthweight.loc[:,"m_feduc"]==1]

In [None]:
# making sure little change on values with the imputaion 
birthweight.describe(include = 'number').round(2)

<br><br>
<strong>Validating features and target variable distributions</strong>
<br>
This section was used to create histograms to identify the distribution of each feature and the response variable to determine if using a log could be useful during the engineering process.
<br><br>
Only four histograms were left in the final code to keep it simple, but the arguments of the histograms were modified in this section to look at the different variables before and after the engineering process. 
<br><br>

In [None]:
# birthweight
sns.histplot(data  = birthweight,
             x     = 'bwght',
             kde    = True)


# title and axis labels
plt.title(label   = "Original Distribution of Birthweight")
plt.xlabel(xlabel = "Birthweight") # avoiding using dataset labels
plt.ylabel(ylabel = "Count")

# displaying the histogram
plt.show()

# cigs
sns.histplot(data  = birthweight,
             x     = 'cigs',
             kde    = True)


# title and axis labels
plt.title(label   = "Original Distribution of Avg Cigarettes per week")
plt.xlabel(xlabel = "Avg Cigarettes per week") # avoiding using dataset labels
plt.ylabel(ylabel = "Count")


# displaying the histogram
plt.show()

# drinks
sns.histplot(data  = birthweight,
             x     = 'drink',
             kde    = True)


# title and axis labels
plt.title(label   = "Original Distribution of Avg Drinks per week")
plt.xlabel(xlabel = "Avg Drinks per week") # avoiding using dataset labels
plt.ylabel(ylabel = "Count")


# displaying the histogram
plt.show()


# mage
sns.histplot(data  = birthweight,
             x     = 'mage',
             kde    = True)


# title and axis labels
plt.title(label   = "Original Distribution of Mother's age")
plt.xlabel(xlabel = "Mother's age") # avoiding using dataset labels
plt.ylabel(ylabel = "Count")


# displaying the histogram
plt.show()


# fage
sns.histplot(data  = birthweight,
             x     = 'fage',
             kde    = True)


# title and axis labels
plt.title(label   = "Original Distribution of Fathers's age")
plt.xlabel(xlabel = "Father's age") # avoiding using dataset labels
plt.ylabel(ylabel = "Count")


# displaying the histogram
plt.show()

<br><br>
<strong>Correlations</strong>
<br>
Different correlation matrix were developed in this section by setting one single variable to easily compare its correlation to the other variables.
<br>
In the example that is kept in the code, I can see that the variables drink, cigs, mage and fage could probably lead to a better model. However, it was also important to obtain the correlation of these variables among them to identify possible multi-collinearity.
<br><br>

In [None]:
# developing a correlation matrix
birthweight_corr = birthweight.corr(method = 'pearson')

# setting absolute value to make it easier to indentify the strongest correlations with bwght (no matter the sign of them)
birthweight_corr.loc[ : , 'bwght'].round(decimals = 2).abs().sort_values(ascending = False)

<br><br>
<strong>Scatter plots</strong>
<br>
Different scatter plots were developed in this section to identify the relationship or not between the different variables. To make it easy to develop many different scatter plots I defined a function to help set the size of the figure and to easily display the plot without having to call the .show() method every time.
<br>
In the code I kept only some examples of all the scatter plots developed in order to keep the code clean and easy to read.
<br><br>

In [None]:
# defining a function for scatter plots
def scatterplots(var, response, data):
    """
	This function is designed to generate a scatterplot
    Make sure matplotlib.pyplot and seaborn have been imported (as plt and sns).

    PARAMETERS
	----------
	var      : str, continuous variable
	response : str, response variable
	data     : DataFrame of the response and variables
	"""

    fig, ax = plt.subplots(figsize = (8, 5))
    
    sns.scatterplot(x    = var,
                    y    = response,
                    data = data)
    
    plt.suptitle("")
    plt.show()

In [None]:
# cigs
scatterplots(response = 'bwght',
			 var      = 'cigs',
			 data     = birthweight)


# drinks
scatterplots(response = 'bwght',
			 var      = 'drink',
			 data     = birthweight)


# mage
scatterplots(response = 'bwght',
			 var      = 'mage',
			 data     = birthweight)


# fage
scatterplots(response = 'bwght',
			 var      = 'fage',
			 data     = birthweight)

# fmaps
scatterplots(response = 'bwght',
			 var      = 'fmaps',
			 data     = birthweight)

In [None]:
# mage only vs bwght
scatterplots(response = 'bwght',
			 var      = 'mage',
			 data     = birthweight)

# drinks only vs bwght
scatterplots(response = 'bwght',
			 var      = 'drink',
			 data     = birthweight)

# cigs only vs bwght
scatterplots(response = 'bwght',
			 var      = 'cigs',
			 data     = birthweight)

# cigs only vs mwhte (better try box plot)
scatterplots(response = 'cigs',
			 var      = 'mwhte',
			 data     = birthweight)

<br><br>
<strong>Box plots</strong>
<br>
Different box plots were developed in this section to identify the relationship or not between the different categorical variables. For these types of variables the box plots present a better visualization than the scatter plots. To make it easy to develop many different box plots I defined a function to help set the size of the figure and to easily display the plot without having to call the .show() method every time.
<br>
In the code I kept only some examples of all the scatter plots developed in order to keep the code clean and easy to read.
<br><br>

In [None]:
# defining a function for categorical boxplots
def categorical_boxplots(cat_var, response, data):
    """
	This function is designed to generate a boxplot for  can be used for categorical variables.
    Make sure matplotlib.pyplot and seaborn have been imported (as plt and sns).

    PARAMETERS
	----------
	cat_var  : str, categorical variable
	response : str, response variable
	data     : DataFrame of the response and categorical variables
	"""

    fig, ax = plt.subplots(figsize = (10, 8))
    
    sns.boxplot(x    = cat_var,
                y    = response,
                data = data)
    
    plt.suptitle("")
    plt.show()

In [None]:
# cigs
categorical_boxplots(response = 'bwght',
					 cat_var  = 'cigs',
					 data     = birthweight)

# drinks
categorical_boxplots(response = 'bwght',
					 cat_var  = 'drink',
					 data     = birthweight)

# cigs vs races
categorical_boxplots(response = 'cigs',
					 cat_var  = 'fblck',
					 data     = birthweight)

# drinks vs races
categorical_boxplots(response = 'drink',
					 cat_var  = 'fblck',
					 data     = birthweight)

<br><br>
<strong>Feature Engineering</strong>
<br>
The first engineering that I used was focused on applying log to the response variable and the features that showed the highest correlation. By doing this I also printed the correlation of the original variables and the engineered ones to determine which ones I would expect to have a bigger impact on my model.
<br><br>

In [None]:
# creating the log of my variables
birthweight['log_bwght'] = np.log(birthweight['bwght'] + 0.0001)
birthweight['log_cigs'] = np.log(birthweight['cigs'] + 0.0001)

# selecting variables to get the correlation
log_corr = birthweight.loc[ : , ['cigs',
                             'log_cigs',
                             'bwght',
                             'log_bwght']  ].corr(method = 'pearson')\
                                                 .round(decimals = 2)
# getting the correlation
log_corr.loc[ ['cigs', 'log_cigs'],
              ['bwght', 'log_bwght']]

In [None]:
# creating the log of my variables
birthweight['log_drinks'] = np.log(birthweight['drink'] + 0.0001)

# selecting variables to get the correlation
log_corr = birthweight.loc[ : , ['drink',
                             'log_drinks',
                             'bwght',
                             'log_bwght']  ].corr(method = 'pearson')\
                                                 .round(decimals = 2)
# getting the correlation
log_corr.loc[ ['drink', 'log_drinks'],

             ['bwght', 'log_bwght']]

In [None]:
# creating the log of my variables
birthweight['log_mage'] = np.log(birthweight['mage'])

# selecting variables to get the correlation
log_corr = birthweight.loc[ : , ['mage',
                             'log_mage',
                             'bwght',
                             'log_bwght']  ].corr(method = 'pearson')\
                                                 .round(decimals = 2)
# getting the correlation
log_corr.loc[ ['mage', 'log_mage'],

             ['bwght', 'log_bwght']]

In [None]:
# creating the log of my variables
birthweight['log_fage'] = np.log(birthweight['fage'])

# selecting variables to get the correlation
log_corr = birthweight.loc[ : , ['fage',
                             'log_fage',
                             'bwght',
                             'log_bwght']  ].corr(method = 'pearson')\
                                                 .round(decimals = 2)
# getting the correlation
log_corr.loc[ ['fage', 'log_fage'],

             ['bwght', 'log_bwght']]

<br><br>
<strong>Feature Engineering continued</strong>
<br>
Later, after returning to the histograms, scatter plots, box plots and correlation matrices developed I began to produce further engineered variables that are explained in the following line of code.
<br>
These newly engineered variables were also evaluated using the previously mentioned techniques, and only some examples of this are kept in the final code to keep it simple and organized.
<br><br>

In [None]:
# creating log of omaps and fmaps to identify if another feature can predict
# with a high score these variables either with log or without it, because the
# values of these variables are correlated with the birthweight
birthweight['log_omaps'] = np.log(birthweight['omaps'])
birthweight['log_fmaps'] = np.log(birthweight['fmaps'])

# creating log of these variables to see if the correlation with birthweight
# improves and to determine if their distribution is closer to the normal
birthweight['log_meduc'] = np.log(birthweight['meduc'])
birthweight['log_monpre'] = np.log(birthweight['monpre'])
birthweight['log_npvis'] = np.log(birthweight['npvis'])
birthweight['log_feduc'] = np.log(birthweight['feduc'])

# creating square of mother age to test it in a model with the normal age
# didn't produce the expected results in the end
birthweight['mage_2'] = birthweight['mage'] * birthweight['mage']

# creating interaction of both parents to test if this interaction generates
# healthier babys as they could be more responsible. 
birthweight['mage_fage'] = birthweight['mage'] * birthweight['fage']

# creating bins to identify if having mothers over 40 has an effect on response variable
birthweight['mage_over40'] = np.where(birthweight['mage'] >= 40, 1, 0)

# creating interactions with the bins for mother's age over 40
# these interactions are to identify if the gender of the baby interacts with the age,
# if the number of cigarretes or drinks have a different effect related to 
# the age of the mother
birthweight['mage_over40_male'] = birthweight['mage_over40'] * birthweight['male']
birthweight['mage_over40_cigs'] = birthweight['mage_over40'] * birthweight['cigs']
birthweight['mage_over40_drinks'] = birthweight['mage_over40'] * birthweight['drink']

# creating variables to determine if any relation between the first visit to the doctor
# and the number of times the mother goes to the doctor during pregnancy has a good
# result in predicting the birthweight
birthweight['monpre_by_npvis'] = birthweight['monpre'] / birthweight['npvis']
birthweight['npvis_by_monpre'] = birthweight['npvis'] / birthweight['monpre']
birthweight['npvis_monpre'] = birthweight['npvis'] * birthweight['monpre']

# creating a varibale to test if this interaction generates healthier
# babys as they could be more responsible. 
birthweight['both_education'] = birthweight['meduc'] + birthweight['feduc']

# creating variables to test if there's an increased effect on birthweight related
# to the interaction between mother's age and the number of cigs or drinks
birthweight['mage_drinks'] = birthweight['mage'] * birthweight['drink']
birthweight['mage_cigs'] = birthweight['mage'] * birthweight['cigs']
birthweight['log_mage_drinks'] = np.log(birthweight['mage_drinks'] + 0.0001)
birthweight['log_mage_cigs'] = np.log(birthweight['mage_cigs'] + 0.0001)
birthweight['mage_plus_drinks'] = birthweight['mage'] + birthweight['drink']
birthweight['mage_plus_cigs'] = birthweight['mage'] + birthweight['cigs']
birthweight['mage_by_drinks'] = birthweight['mage'] / (birthweight['drink'] + 0.0001)
birthweight['drinks_by_mage'] = birthweight['drink'] / birthweight['mage']
birthweight['log_mage_by_drinks'] = np.log(birthweight['mage_by_drinks'])
birthweight['log_drinks_by_mage'] = np.log(birthweight['drinks_by_mage'] + 0.0001)
birthweight['mage_by_cigs'] = birthweight['mage'] / (birthweight['cigs'] + 0.0001)
birthweight['cigs_by_mage'] = birthweight['cigs'] / birthweight['mage']
birthweight['log_mage_by_cigs'] = np.log(birthweight['mage_by_cigs'])
birthweight['log_cigs_by_mage'] = np.log(birthweight['cigs_by_mage'] + 0.0001)
birthweight['mage_by_drinks_cigs'] = birthweight['mage'] / (birthweight['drink'] + birthweight['cigs'] + 0.0001)
birthweight['drinks_cigs_by_mage'] = (birthweight['drink'] + birthweight['cigs']) / birthweight['mage']
birthweight['log_mage_by_drinks_cigs'] = np.log(birthweight['mage_by_drinks_cigs'])
birthweight['log_drinks_cigs_by_mage'] = np.log(birthweight['drinks_cigs_by_mage'] + 0.0001)

# creating variables to identify if drinks and cigs combined produce a better model
birthweight['drinks_plus_cigs'] = birthweight['drink'] + birthweight['cigs']
birthweight['drinks_cigs'] = birthweight['drink'] * birthweight['cigs']

# creating variables to determine if there's an interaction between mother's age
# and cigs or drinks but considering the maximum number of cigs and drinks included
# in the dataset. In the end these variables produced the best model, but
# they were not really useful as predictor beacuse we wouldn't have the max number
# of cigs or drinks when new independent data comes into place or in a complete
# different dataset this could result in being not significant
birthweight['mage_by_top_cigs'] = birthweight['mage'] / (birthweight['cigs'].max() + 1 - birthweight['cigs'])
birthweight['mage_by_top_drinks'] = birthweight['mage'] / (birthweight['drink'].max() + 1 - birthweight['drink'])

# creating variables to identify if having both parents from the same race has an effec
# on birthweight prediction
birthweight['both_race_w'] = birthweight['mwhte'] * birthweight['fwhte']
birthweight['both_race_b'] = birthweight['mblck'] * birthweight['fblck']
birthweight['both_race_oth'] = birthweight['moth'] * birthweight['foth']

# creating variables to identify if having both parents from the different races has
# an effec on birthweight prediction
birthweight['both_race_mw_fb'] = birthweight['mwhte'] * birthweight['fblck']
birthweight['both_race_mw_fo'] = birthweight['mwhte'] * birthweight['foth']
birthweight['both_race_mb_fo'] = birthweight['mblck'] * birthweight['foth']
birthweight['both_race_fw_mb'] = birthweight['fwhte'] * birthweight['mblck']
birthweight['both_race_fw_mo'] = birthweight['fwhte'] * birthweight['moth']
birthweight['both_race_fb_mo'] = birthweight['fblck'] * birthweight['moth']

<br><br>
<strong>Analysis of engineered variables</strong>
<br>
The techniques used were value counts to evaluate the resultant sample sizes, histograms to look at the distributions, correlation matrices to identify the strongest correlations or multi collinearity, scatter plots and box plots to visualize relationship between variables and possible categories or interactions that could be further created.
<br>
The codes for these charts and the engineered variables were run multiple times using different arguments and creating many different categories and variables.
<br>
Also, when developing the models, these lines of code were also run to continue visualizing and validating the relationships, and engineering new variables or categories.
<br><br>

In [None]:
#value counts used for the different categories created to evaluate resultant smaple sizes
birthweight['both_race_w'].value_counts()

In [None]:
# developing correlation matrices for specific variables
birthweight_corr = birthweight.corr(method = 'pearson')


birthweight_corr.loc[ ['bwght', 'log_bwght', 'feduc', 'mwhte',
                       'fwhte', 'meduc', 'mage', 'moth', 'foth'],
                      ['bwght', 'log_bwght', 'feduc', 'mwhte',
                       'fwhte','meduc', 'mage', 'moth', 'foth']].round(decimals = 2)

In [None]:
# developing other correlation matrix for specific variables
birthweight_corr = birthweight.corr(method = 'pearson')


birthweight_corr.loc[ ['bwght', 'log_bwght', 'drink', 'mage_drinks', 'log_mage_drinks', 'cigs', 'mage_cigs', 'log_mage_cigs', 'mage_plus_drinks', 'mage_plus_cigs'],
                      ['bwght', 'log_bwght', 'drink', 'mage_drinks', 'log_mage_drinks', 'cigs', 'mage_cigs', 'log_mage_cigs', 'mage_plus_drinks', 'mage_plus_cigs']].round(decimals = 2)

In [None]:
# developing other correlation matrix for specific variables
birthweight_corr = birthweight.corr(method = 'pearson')

birthweight_corr.loc[ ['bwght', 'log_bwght', 'mage_by_drinks', 'drinks_by_mage',
                       'log_mage_by_drinks', 'log_drinks_by_mage', 'mage_by_cigs',
                       'cigs_by_mage', 'log_mage_by_cigs', 'log_cigs_by_mage',
                       'mage_by_drinks_cigs', 'drinks_cigs_by_mage',
                       'log_mage_by_drinks_cigs', 'log_drinks_cigs_by_mage'],
                      ['bwght', 'log_bwght', 'mage_by_drinks', 'drinks_by_mage',
                       'log_mage_by_drinks', 'log_drinks_by_mage', 'mage_by_cigs',
                       'cigs_by_mage', 'log_mage_by_cigs', 'log_cigs_by_mage',
                       'mage_by_drinks_cigs', 'drinks_cigs_by_mage',
                       'log_mage_by_drinks_cigs', 'log_drinks_cigs_by_mage']].round(decimals = 2).abs().sort_values(ascending = False, by = 'bwght')

In [None]:
# fage vs log
scatterplots(response = 'log_bwght',
			 var      = 'fage',
			 data     = birthweight)

# mage vs log
scatterplots(response = 'log_bwght',
			 var      = 'mage',
			 data     = birthweight)

In [None]:
########################### drinks
# mage and drinks
scatterplots(response = 'bwght',
			 var      = 'mage_drinks',
			 data     = birthweight)

# mage and drinks vs log
scatterplots(response = 'log_bwght',
			 var      = 'mage_drinks',
			 data     = birthweight)

# mage only
scatterplots(response = 'bwght',
			 var      = 'mage',
			 data     = birthweight)

# drinks only
scatterplots(response = 'bwght',
			 var      = 'drink',
			 data     = birthweight)


######################### cigs

# mage and cigs
scatterplots(response = 'bwght',
			 var      = 'mage_cigs',
			 data     = birthweight)

# mage and cigs vs log
scatterplots(response = 'log_bwght',
			 var      = 'mage_cigs',
			 data     = birthweight)

# mage only
scatterplots(response = 'bwght',
			 var      = 'mage',
			 data     = birthweight)

# cigs only
scatterplots(response = 'bwght',
			 var      = 'cigs',
			 data     = birthweight)

In [None]:
# mage_drinks
sns.histplot(data  = birthweight,
             x     = 'mage_drinks',
             kde    = True)


# title and axis labels
plt.title(label   = "Original Distribution of Mothers age interaction with number of drinks")
plt.xlabel(xlabel = "Mothers age interaction with number of drinks")
plt.ylabel(ylabel = "Count")


# displaying the histogram
plt.show()


# log_mage_drinks
sns.histplot(data  = birthweight,
             x     = 'log_mage_drinks',
             kde    = True)


# title and axis labels
plt.title(label   = "Log Distribution of Mothers age interaction with number of drinks")
plt.xlabel(xlabel = "Log of Mothers age interaction with number of drinks")
plt.ylabel(ylabel = "Count")


# displaying the histogram
plt.show()


# mage_cigs
sns.histplot(data  = birthweight,
             x     = 'mage_cigs',
             kde    = True)


# title and axis labels
plt.title(label   = "Original Distribution of Mothers age interaction with number of cigs")
plt.xlabel(xlabel = "Mothers age interaction with number of cigs") # avoiding using dataset labels
plt.ylabel(ylabel = "Count")


# displaying the histogram
plt.show()

# log_mage_cigs
sns.histplot(data  = birthweight,
             x     = 'log_mage_cigs',
             kde    = True)


# title and axis labels
plt.title(label   = "Log Distribution of Mothers age interaction with number of cigs")
plt.xlabel(xlabel = "Log of Mothers age interaction with number of drinks") # avoiding using dataset labels
plt.ylabel(ylabel = "Count")


# displaying the histogram
plt.show()

In [None]:
# cigs
categorical_boxplots(response = 'bwght',
					 cat_var  = 'cigs',
					 data     = birthweight)


# cigs
categorical_boxplots(response = 'log_bwght',
					 cat_var  = 'cigs',
					 data     = birthweight)


# cigs
categorical_boxplots(response = 'log_bwght',
					 cat_var  = 'log_cigs',
					 data     = birthweight)


##############################################################
# drinks
categorical_boxplots(response = 'bwght',
					 cat_var  = 'drink',
					 data     = birthweight)

# drinks
categorical_boxplots(response = 'log_bwght',
					 cat_var  = 'drink',
					 data     = birthweight)

# drinks
categorical_boxplots(response = 'log_bwght',
					 cat_var  = 'log_drinks',
					 data     = birthweight)

########################################################
# mothers race and educ 

# mwhte
categorical_boxplots(response = 'meduc',
					 cat_var  = 'mwhte',
					 data     = birthweight)

# mblck
categorical_boxplots(response = 'meduc',
					 cat_var  = 'mblck',
					 data     = birthweight)

# moth
categorical_boxplots(response = 'meduc',
					 cat_var  = 'moth',
					 data     = birthweight)

###################################
# fathers race and educ 

# mwhte
categorical_boxplots(response = 'feduc',
					 cat_var  = 'fwhte',
					 data     = birthweight)

# mblck
categorical_boxplots(response = 'feduc',
					 cat_var  = 'fblck',
					 data     = birthweight)

# moth
categorical_boxplots(response = 'feduc',
					 cat_var  = 'foth',
					 data     = birthweight)

<br><br>
<strong>First OLS model</strong>
<br>
In the following section I created a linear regression using different combinations of original and engineered features to look for p-values, and possible new variables to be developed.  
<br>
As mentioned before, this was an on going process coming back to this section and the creation and validation of new variables and interactions.
<br><br>

In [None]:
# used to have my columns in the format required for the OLS
for col in birthweight.columns:
    print(col, end = ' + ')

In [None]:
# INSTANTIATE the model object
lm_best = smf.ols(formula =  """bwght ~
cigs +
drink +
drinks_cigs""",
                                data = birthweight)

# FIT the data into the model object
results = lm_best.fit()

# SUMMARY output
print(results.summary())

<br><br>
<strong>Regression models</strong>
<br>
In the following sections I modeled many regression models using OLS, Lasso and ARD. I used test, train split to divide my dataset and be able to train my data with different features and then test it with three models.
<br>
As mentioned before, this was an on going process coming back to this section and the creation and validation of new variables and interactions.
<br><br>

In [None]:
# Linear regression

from sklearn.model_selection import train_test_split
# preparing the features by dropping the response variable and its log that was created before
birthweight_data = birthweight.drop(["bwght",
                                     "log_bwght"],
                                     axis = 1)

# preparing response variables, original and log
birthweight_target = birthweight.loc[ : , "bwght"]
log_birthweight_target = birthweight.loc[ : , "log_bwght"]


# preparing training and testing sets
x_train, x_test, y_train, y_test = train_test_split(
            birthweight_data,
            birthweight_target,
            test_size = 0.25,
            random_state = 219)


# checking the shapes of the datasets
print(f"""
Training Data
-------------
X-side: {x_train.shape}
y-side: {y_train.shape}


Testing Data
------------
X-side: {x_test.shape}
y-side: {y_test.shape}
""")

<br><br>
The next line of code was used to set my x_variables intended to be used in my different regression models. This section was constantly modified to try different features to obtain the best possible model.
<br>
As mentioned before, this was an on going process coming back to this section and the creation and validation of new variables and interactions.
<br>

In [None]:
# declaring set of x-variables
#x_variables = ['meduc', 'fage', 'male', 'mwhte', 'moth', 'm_meduc', 'm_npvis', 'm_feduc', 'fmaps', 'log_omaps', 'log_monpre', 'log_npvis', 'log_feduc', 'mage_drinks', 'mage_cigs', 'mage_plus_drinks', 'mage_plus_cigs', 'drinks_by_mage', 'log_mage_by_drinks']

#x_variables = ['meduc', 'fage', 'male', 'mwhte', 'moth', 'm_meduc', 'm_npvis', 'm_feduc', 'fmaps', 'log_omaps', 'log_monpre', 'log_npvis', 'log_feduc', 'mage_drinks', 'mage_cigs', 'mage_plus_drinks', 'drinks_by_mage', 'log_mage_by_drinks']

#x_variables = ['meduc', 'fage', 'male', 'mwhte', 'moth', 'm_meduc', 'm_npvis', 'm_feduc', 'fmaps', 'log_omaps', 'log_monpre', 'log_npvis', 'log_feduc', 'mage_cigs', 'mage_plus_drinks', 'drinks_by_mage', 'log_mage_by_drinks']

#x_variables = ['meduc', 'fage', 'male', 'mwhte', 'moth', 'm_meduc', 'm_npvis', 'm_feduc', 'fmaps', 'log_monpre', 'log_npvis', 'log_feduc', 'mage_cigs', 'mage_plus_drinks', 'drinks_by_mage', 'log_mage_by_drinks']

#x_variables = ['meduc', 'male', 'mwhte', 'moth', 'm_meduc', 'm_npvis', 'm_feduc', 'fmaps', 'log_monpre', 'log_npvis', 'log_feduc', 'mage_cigs', 'mage_plus_drinks', 'drinks_by_mage', 'log_mage_by_drinks']

#x_variables = ['meduc', 'male', 'moth', 'm_meduc', 'm_npvis', 'm_feduc', 'fmaps', 'log_monpre', 'log_npvis', 'log_feduc', 'mage_cigs', 'mage_plus_drinks', 'drinks_by_mage', 'log_mage_by_drinks']

#x_variables = ['meduc', 'male', 'moth', 'm_meduc', 'm_npvis', 'm_feduc', 'fmaps', 'log_npvis', 'log_feduc', 'mage_cigs', 'mage_plus_drinks', 'drinks_by_mage', 'log_mage_by_drinks']

#x_variables = ['meduc', 'male', 'moth', 'm_meduc', 'm_npvis', 'm_feduc', 'log_npvis', 'log_feduc', 'mage_cigs', 'mage_plus_drinks', 'drinks_by_mage', 'log_mage_by_drinks']

#x_variables = ['meduc', 'male', 'moth', 'm_meduc', 'm_npvis', 'log_npvis', 'log_feduc', 'mage_cigs', 'mage_plus_drinks', 'drinks_by_mage', 'log_mage_by_drinks']

#x_variables = ['meduc', 'moth', 'm_meduc', 'm_npvis', 'log_feduc', 'mage_cigs', 'mage_plus_drinks', 'drinks_by_mage', 'log_mage_by_drinks']

#x_variables = ['meduc', 'moth', 'm_npvis', 'log_feduc', 'mage_cigs', 'mage_plus_drinks', 'drinks_by_mage', 'log_mage_by_drinks']

#x_variables = ['meduc', 'm_npvis', 'log_feduc', 'mage_cigs', 'mage_plus_drinks', 'drinks_by_mage', 'log_mage_by_drinks']

#x_variables = ['meduc', 'm_npvis', 'mage_cigs', 'mage_plus_drinks', 'drinks_by_mage', 'log_mage_by_drinks']

#x_variables = ['meduc', 'mage_cigs', 'mage_plus_drinks', 'drinks_by_mage', 'log_mage_by_drinks']

#x_variables = ['meduc', 'mage_cigs', 'mage_plus_drinks', 'drinks_by_mage']

#x_variables = ['mage_cigs', 'mage_plus_drinks', 'drinks_by_mage', 'log_meduc']

# best ARD x_variables = ['cigs', 'drink', 'both_race_w', 'both_race_b', 'mage_by_top_cigs', 'meduc']

# equal best ARD x_variables = ['cigs', 'drink', 'mage_by_top_cigs', 'meduc']

# best OLS x_variables = ['cigs', 'drink', 'both_race_w', 'both_race_b', 'mage_by_top_cigs']

# second best OLS x_variables = ['cigs', 'drink', 'both_race_w', 'mage_by_top_cigs'] 

# second best ARD x_variables = ['cigs', 'drink', 'mage_by_top_cigs', 'feduc']

# also good model - x_variables = ['cigs', 'drink', 'meduc']

# also good model x_variables = ['cigs', 'drink', 'mage_by_top_cigs']

# also good model x_variables = ['cigs', 'drink', 'both_race_w', 'meduc']

x_variables = ['cigs', 'drink', 'both_race_w', 'both_race_b', 'meduc']

# looping to make x-variables suitable for statsmodels
for val in x_variables:
    print(f"{val} +")

In [None]:
# merging X_train and y_train so that they can be used in statsmodels
birthweight_train = pd.concat([x_train, y_train], axis = 1)

#building model
lm_best = smf.ols(formula =  """bwght ~
cigs +
drink +
both_race_w +
both_race_b +
meduc""",
data = birthweight_train)


#fitting model based on the data
results = lm_best.fit()



#printing summary of the model
print(results.summary())

<br><br>
<strong>Preparing data for the models</strong>
<br>
I set my data to include only the x_variables previously defined and my target data to include only my response variable (also evaluated several times with log).
<br>
Two different train-test-splits were created to be able to run my models on the full data and also only on my x_variables. 
<br><br>

In [None]:
# applying modelin scikit-learn

# preparing x-variables from the OLS model, this is also used in other models to test with reduced features
ols_data = birthweight.loc[ : , x_variables]


# preparing response variable
birthweight_target = birthweight.loc[ : , "bwght"]


###############################################
## setting up more than one train-test split ##
###############################################
# FULL X-dataset (normal Y)
x_train_FULL, x_test_FULL, y_train_FULL, y_test_FULL = train_test_split(
            birthweight_data,     # x-variables
            birthweight_target,   # y-variable
            test_size = 0.25,
            random_state = 219)


# OLS p-value x-dataset (normal Y)
x_train_OLS, x_test_OLS, y_train_OLS, y_test_OLS = train_test_split(
            ols_data,         # x-variables
            birthweight_target,   # y-variable
            test_size = 0.25,
            random_state = 219)

<br><br>
<strong>OLS model</strong>
<br>
Running the OLS with all the features.
<br>

In [None]:
from sklearn.linear_model import LinearRegression

# INSTANTIATING a model object
lr = LinearRegression()


# FITTING to the training data, full dataset
lr_fit = lr.fit(x_train_FULL, y_train_FULL)


# PREDICTING on new data, full dataset
lr_pred = lr_fit.predict(x_test_FULL)


# SCORING the results
print('OLS Training Score :', lr.score(x_train_FULL, y_train_FULL).round(4))  # using R-square
print('OLS Testing Score  :',  lr.score(x_test_FULL, y_test_FULL).round(4)) # using R-square

lr_train_score = lr.score(x_train_FULL, y_train_FULL).round(4)
lr_test_score = lr.score(x_test_FULL, y_test_FULL).round(4)


# displaying and saving the gap between training and testing
print('OLS Train-Test Gap :', abs(lr_train_score - lr_test_score).round(4))
lr_test_gap = abs(lr_train_score - lr_test_score).round(4)

<br>
<strong>OLS model</strong>
<br>
Running the OLS with the x_variables features.
<br>

In [None]:
# INSTANTIATING a model object
lr = LinearRegression()


# FITTING to the training data, only with the x_variables defined
# for simplicity the name OLS is kept for future models
lr_fit = lr.fit(x_train_OLS, y_train_OLS)


# PREDICTING on new data, only with the x_variables defined
# for simplicity the name OLS is kept for future models
lr_pred = lr_fit.predict(x_test_OLS)


# SCORING the results
print('OLS Training Score :', lr.score(x_train_OLS, y_train_OLS).round(4))  # using R-square
print('OLS Testing Score  :',  lr.score(x_test_OLS, y_test_OLS).round(4)) # using R-square

lr_train_score = lr.score(x_train_OLS, y_train_OLS).round(4)
lr_test_score = lr.score(x_test_OLS, y_test_OLS).round(4)


# displaying and saving the gap between training and testing
print('OLS Train-Test Gap :', abs(lr_train_score - lr_test_score).round(4))
lr_test_gap = abs(lr_train_score - lr_test_score).round(4)

In [None]:
# zipping each feature name to its coefficient
lr_model_values = zip(birthweight_data[x_variables].columns,
                      lr_fit.coef_.round(decimals = 2))


# setting up a placeholder list to store model features
lr_model_lst = [('intercept', lr_fit.intercept_.round(decimals = 2))]


# printing out each feature-coefficient pair one by one
for val in lr_model_values:
    lr_model_lst.append(val)
    

# checking the results
for pair in lr_model_lst:
    print(pair)

<br>
<strong>Lasso Regression</strong>
<br>
Running the Lasso Regression with the all the features.
<br>

In [None]:
from sklearn.linear_model import Lasso

# INSTANTIATING a model object
lasso_model = Lasso(alpha     = 1,
                    normalize = True)


# FITTING to training data, full dataset
lasso_fit = lasso_model.fit(x_train_FULL, y_train_FULL)


# PREDICTING on new data, full dataset
lasso_pred = lasso_fit.predict(x_test_FULL)


# SCORING the results
print('Lasso Training Score :', lasso_model.score(x_train_FULL, y_train_FULL).round(4))
print('Lasso Testing Score  :', lasso_model.score(x_test_FULL, y_test_FULL).round(4))


# saving scoring data for future use
lasso_train_score = lasso_model.score(x_train_FULL, y_train_FULL).round(4) # using R-square
lasso_test_score  = lasso_model.score(x_test_FULL, y_test_FULL).round(4)   # using R-square


# displaying and saving the gap between training and testing
print('Lasso Train-Test Gap :', abs(lasso_train_score - lasso_test_score).round(4))
lasso_test_gap = abs(lasso_train_score - lasso_test_score).round(4)

<br>
<strong>Lass Regression</strong>
<br>
Running the Lasso Regression with the x_variables features.
<br>

In [None]:
from sklearn.linear_model import Lasso

# INSTANTIATING a model object
lasso_model = Lasso(alpha     = 1,
                    normalize = True)


# FITTING to the training data, only with the x_variables defined
# for simplicity the name OLS was kept
lasso_fit = lasso_model.fit(x_train_OLS, y_train_OLS)


# PREDICTING on new data, only with the x_variables defined
# for simplicity the name OLS was kept
lasso_pred = lasso_fit.predict(x_test_OLS)


# SCORING the results
print('Lasso Training Score :', lasso_model.score(x_train_OLS, y_train_OLS).round(4))
print('Lasso Testing Score  :', lasso_model.score(x_test_OLS, y_test_OLS).round(4))


# saving scoring data for future use
lasso_train_score = lasso_model.score(x_train_OLS, y_train_OLS).round(4) # using R-square
lasso_test_score  = lasso_model.score(x_test_OLS, y_test_OLS).round(4)   # using R-square


# displaying and saving the gap between training and testing
print('Lasso Train-Test Gap :', abs(lasso_train_score - lasso_test_score).round(4))
lasso_test_gap = abs(lasso_train_score - lasso_test_score).round(4)

In [None]:
# zipping each feature name to its coefficient
# this part has to be modified to include only the [x_varibales] when running the OLS split,
# or to not include [x_variables] when running the full split
lasso_model_values = zip(birthweight_data[x_variables].columns, lasso_fit.coef_.round(decimals = 2))


# setting up a placeholder list to store model features
lasso_model_lst = [('intercept', lasso_fit.intercept_.round(decimals = 2))]


# printing out each feature-coefficient pair one by one
for val in lasso_model_values:
    lasso_model_lst.append(val)
    

# checking the results
for pair in lasso_model_lst:
    print(pair)

In [None]:
# dropping coefficients that are equal to zero
# this is run when the full dataset is used in the Lasso Regression to 
# determine the variables that could be removed from the x_variables

# printing out each feature-coefficient pair one by one
for feature, coefficient in lasso_model_lst:
        
        if coefficient == 0:
            lasso_model_lst.remove((feature, coefficient))

            
# checking the results
for pair in lasso_model_lst:
    print(pair)

<br>
<strong>ARD Regression</strong>
<br>
Running the ARD Regression with all the features.
<br>

In [None]:
from sklearn.linear_model import ARDRegression
# INSTANTIATING a model object
ard_model = ARDRegression()


# FITTING the training data, full dataset
ard_fit = ard_model.fit(x_train_FULL, y_train_FULL)


# PREDICTING on new data, full dataset
ard_pred = ard_fit.predict(x_test_FULL)


print('Training Score:', ard_model.score(x_train_FULL, y_train_FULL).round(4))
print('Testing Score :', ard_model.score(x_test_FULL, y_test_FULL).round(4))


# saving scoring data for future use
ard_train_score = ard_model.score(x_train_FULL, y_train_FULL).round(4)
ard_test_score  = ard_model.score(x_test_FULL, y_test_FULL).round(4)


# displaying and saving the gap between training and testing
print('ARD Train-Test Gap :', abs(ard_train_score - ard_test_score).round(4))
ard_test_gap = abs(ard_train_score - ard_test_score).round(4)

In [None]:
from sklearn.linear_model import ARDRegression
# INSTANTIATING a model object
ard_model = ARDRegression()


# FITTING the training data, only with the x_variables defined
# for simplicity the name OLS was kept
ard_fit = ard_model.fit(x_train_OLS, y_train_OLS)


# PREDICTING on new data, only with the x_variables defined
# for simplicity the name OLS was kept
ard_pred = ard_fit.predict(x_test_OLS)


print('Training Score:', ard_model.score(x_train_OLS, y_train_OLS).round(4))
print('Testing Score :', ard_model.score(x_test_OLS, y_test_OLS).round(4))


# saving scoring data for future use
ard_train_score = ard_model.score(x_train_OLS, y_train_OLS).round(4)
ard_test_score  = ard_model.score(x_test_OLS, y_test_OLS).round(4)


# displaying and saving the gap between training and testing
print('ARD Train-Test Gap :', abs(ard_train_score - ard_test_score).round(4))
ard_test_gap = abs(ard_train_score - ard_test_score).round(4)

In [None]:
# zipping each feature name to its coefficient
# this part has to be modified to include only the [x_varibales] when running the OLS split,
# or to not include [x_variables] when running the full split
ard_model_values = zip(birthweight_data[x_variables], ard_fit.coef_.round(decimals = 5))


# setting up a placeholder list to store model features
ard_model_lst = [('intercept', ard_fit.intercept_.round(decimals = 2))]


# printing out each feature-coefficient pair one by one
for val in ard_model_values:
    ard_model_lst.append(val)
    

# checking the results
for pair in ard_model_lst:
    print(pair)

In [None]:
# dropping coefficients that are equal to zero
# this is run when the full dataset is used in the Lasso Regression to 
# determine the variables that could be removed from the x_variables

# printing out each feature-coefficient pair one by one
for feature, coefficient in ard_model_lst:
        
        if coefficient == 0:
            ard_model_lst.remove((feature, coefficient))

            
# checking the results
for pair in ard_model_lst:
    print(pair)

<br>
<strong>Results of OLS, Lasso and ARD regressions</strong>
<br>
I compared every model results considering the test score, train score and the gap between them.
<br>

In [None]:
# comparing results

print(f"""
Model      Train Score      Test Score      Train-Test Gap
-----      -----------      ----------      ----------
OLS        {lr_train_score}           {lr_test_score}           {lr_test_gap}
Lasso      {lasso_train_score}           {lasso_test_score}            {lasso_test_gap}
ARD        {ard_train_score}           {ard_test_score}           {ard_test_gap}
""")


# creating a dictionary for model results
model_performance = {
    
    'Model Type'    : ['OLS', 'Lasso', 'ARD'],
           
    'Training' : [lr_train_score, lasso_train_score,
                                   ard_train_score],
           
    'Testing'  : [lr_test_score, lasso_test_score,
                                   ard_test_score],
                    
    'Train-Test Gap' : [lr_test_gap, lasso_test_gap,
                                        ard_test_gap],
                    
    'Model Size' : [len(lr_model_lst), len(lasso_model_lst),
                                    len(ard_model_lst)],
                    
    'Model' : [lr_model_lst, lasso_model_lst, ard_model_lst]}


# converting model_performance into a DataFrame
model_performance = pd.DataFrame(model_performance)


# # sending model results to Excel
# model_performance.to_excel('./Model_results/linear_model_performance.xlsx',
#                            index = False)

<strong>Model predictions</strong>
<br>
I decided to extract my model predictions and save them in an Excel file. Then I used this file to try to identify even more improvements for my model by looking at the highest and lowest residuals from my different models.
<br>

In [None]:
prediction_results = pd.DataFrame(data = {
    'Original Birthweight' : y_test_FULL,
    'LR Predictions'       : lr_pred.round(decimals = 2),
    'Lasso Predictions'    : lasso_pred.round(decimals = 2),
    'ARD Predictions'      : ard_pred.round(decimals = 2),
    'LR Deviation'         : lr_pred.round(decimals = 2) - y_test_FULL,
    'Lasso Deviation'      : lasso_pred.round(decimals = 2) - y_test_FULL,
    'ARD Deviation'        : ard_pred.round(decimals = 2) - y_test_FULL,
    })


# prediction_results.to_excel(excel_writer = './Model_results/linear_model_predictions.xlsx',
#                             index = False)

<br><br>
<strong>KNN Regression</strong>
<br>
I developed a KNN Regression model by either not scaling my data and then by scaling it. I compared the results from these two options and also improved them by finding the optimal number of neighbors to be used.
<br>
Since non of the outputs obtained by using the KNN regression resulted in better scores and gap reduction between test and train, I decided to not include the outputs of these models in my previous comparisons.
<br>

In [None]:
from sklearn.neighbors import KNeighborsRegressor # KNN for Regression
from sklearn.preprocessing import StandardScaler

# INSTANTIATING StandardScaler() object
scaler = StandardScaler()


# FITTING scaler with data, using only my x_variables stored in ols_data
scaler.fit(ols_data)


# TRANSFORMING data after fit into scaled
x_scaled = scaler.transform(ols_data)


# converting scaled data into a DataFrame
x_scaled_df = pd.DataFrame(x_scaled)


# checking the results
x_scaled_df.describe().round(2)

In [None]:
# adding labels to scaled DataFrame
x_scaled_df.columns = ols_data.columns

#  comparing pre- and post-scaling of data
print(f"""
Dataset BEFORE Scaling
----------------------
{np.var(ols_data)}


Dataset AFTER Scaling
----------------------
{np.var(x_scaled_df)}
""")

<br>
<strong>k-Nearest Neighbors with Non-Standardized Data</strong>

In [None]:
# spliting the data with all features
x_train, x_test, y_train, y_test = train_test_split(
            ols_data,
            birthweight_target,
            test_size = 0.25,
            random_state = 219)

In [None]:
from sklearn.neighbors import KNeighborsRegressor

# INSTANTIATING KNN model object
knn_reg = KNeighborsRegressor(algorithm = 'auto',
                              n_neighbors = 7)


# FITTING to training data
knn_fit = knn_reg.fit(x_train, y_train)


# PREDICTING on new data
knn_reg_pred = knn_fit.predict(x_test)


# SCORING results
print('KNN Training Score:', knn_reg.score(x_train, y_train).round(4))
print('KNN Testing Score :',  knn_reg.score(x_test, y_test).round(4))


# saving scoring data for future use
knn_reg_score_train = knn_reg.score(x_train, y_train).round(4)
knn_reg_score_test  = knn_reg.score(x_test, y_test).round(4)


# displaying and saving the gap between training and testing
print('KNN Train-Test Gap:', abs(knn_reg_score_train - knn_reg_score_test).round(4))
knn_reg_test_gap = abs(knn_reg_score_train - knn_reg_score_test).round(4)

In [None]:
# creating lists for training set accuracy and test set accuracy
training_accuracy = []
test_accuracy     = []


# building a visualization of 1 to 50 neighbors
neighbors_settings = range(1, 51)


for n_neighbors in neighbors_settings:
    # Building the model
    clf = KNeighborsRegressor(n_neighbors = n_neighbors)
    clf.fit(x_train, y_train)
    
    # Recording the training set accuracy
    training_accuracy.append(clf.score(x_train, y_train))
    
    # Recording the generalization accuracy
    test_accuracy.append(clf.score(x_test, y_test))


# plotting the visualization
fig, ax = plt.subplots(figsize=(12,8))
plt.plot(neighbors_settings, training_accuracy, label = "training accuracy")
plt.plot(neighbors_settings, test_accuracy, label = "test accuracy")
plt.ylabel("Accuracy")
plt.xlabel("n_neighbors")
plt.legend()
plt.show()

In [None]:
# finding the optimal number of neighbors
opt_neighbors = test_accuracy.index(max(test_accuracy)) + 1
print(f"""The optimal number of neighbors is {opt_neighbors}""")

In [None]:
# INSTANTIATING a model with the optimal number of neighbors
knn_opt = KNeighborsRegressor(algorithm   = 'auto',
                              n_neighbors = opt_neighbors)



# FITTING model based on the training data
knn_opt_fit = knn_opt.fit(x_train, y_train)



# PREDITCING on new data
knn_opt_pred = knn_opt_fit.predict(x_test)



# SCORING results
print('KNN Training Score:', knn_opt.score(x_train, y_train).round(4))
print('KNN Testing Score :',  knn_opt.score(x_test, y_test).round(4))


# saving scoring data for future use
knn_opt_score_train = knn_opt.score(x_train, y_train).round(4)
knn_opt_score_test  = knn_opt.score(x_test, y_test).round(4)


# displaying and saving the gap between training and testing
print('KNN Train-Test Gap:', abs(knn_opt_score_train - knn_opt_score_test).round(4))
knn_opt_test_gap = abs(knn_opt_score_train - knn_opt_score_test).round(4)

<br>
<strong>Splitting the scaled data into my test and train sets</strong>
<br>

In [None]:
x_train_STAND, x_test_STAND, y_train_STAND, y_test_STAND = train_test_split(
            x_scaled_df,
            birthweight_target,
            test_size = 0.25,
            random_state = 219)

In [None]:
# creating lists for training set accuracy and test set accuracy
training_accuracy = []
test_accuracy = []


# building a visualization of 1 to 50 neighbors
neighbors_settings = range(1, 51)


for n_neighbors in neighbors_settings:
    # Building the model
    clf = KNeighborsRegressor(n_neighbors = n_neighbors)
    clf.fit(x_train_STAND, y_train_STAND)
    
    # Recording the training set accuracy
    training_accuracy.append(clf.score(x_train_STAND, y_train_STAND))
    
    # Recording the generalization accuracy
    test_accuracy.append(clf.score(x_test_STAND, y_test_STAND))


# plotting the visualization
fig, ax = plt.subplots(figsize=(12,8))
plt.plot(neighbors_settings, training_accuracy, label = "training accuracy")
plt.plot(neighbors_settings, test_accuracy,     label = "test accuracy")
plt.ylabel("Accuracy")
plt.xlabel("n_neighbors")
plt.legend()
plt.show()


# finding the optimal number of neighbors
opt_neighbors = test_accuracy.index(max(test_accuracy)) + 1
print(f"""The optimal number of neighbors is {opt_neighbors}""")

In [None]:
# INSTANTIATING a model with the optimal number of neighbors
knn_stand = KNeighborsRegressor(algorithm = 'auto',
                                n_neighbors = opt_neighbors)



# FITTING model based on the training data
knn_stand_fit = knn_stand.fit(x_train_STAND, y_train_STAND)



# PREDITCING on new data
knn_stand_pred = knn_stand_fit.predict(x_test_STAND)



# SCORING the results
print('KNN Training Score:', knn_stand.score(x_train_STAND, y_train_STAND).round(4))
print('KNN Testing Score :',  knn_stand.score(x_test_STAND, y_test_STAND).round(4))


# saving scoring data for future use
knn_stand_score_train = knn_stand.score(x_train_STAND, y_train_STAND).round(4)
knn_stand_score_test  = knn_stand.score(x_test_STAND, y_test_STAND).round(4)


# displaying and saving the gap between training and testing
print('KNN Train-Test Gap:', abs(knn_stand_score_train - knn_stand_score_test).round(4))
knn_stand_test_gap = abs(knn_stand_score_train - knn_stand_score_test).round(4)

<br>
<strong>Model results for k-Nearest Neighbors</strong>

In [None]:
# comparing results

print(f"""
KNN Model             Neighbors     Train Score      Test Score
----------------      ---------     ----------       ----------
Non-Standardized      1             {knn_reg_score_train}           {knn_reg_score_test}
Non-Standardized      14            {knn_opt_score_train}           {knn_opt_score_test}
Standardized          6             {knn_stand_score_train}           {knn_stand_score_test}
""")


# creating a dictionary for model results
model_performance = {
    
    'Model Type'    : ['KNN_Not_Standardized', 'KNN_Not_Standardized_Opt', 'KNN_Standardized_Opt'],
           
    
    'Training' : [knn_reg_score_train,
                  knn_opt_score_train,
                  knn_stand_score_train],
           
    
    'Testing'  : [knn_reg_score_test,
                  knn_opt_score_test,
                  knn_stand_score_test],
                    
    
    'Train-Test Gap' : [knn_reg_test_gap,
                        knn_opt_test_gap,
                        knn_stand_test_gap],
                   
    
    'Model Size' : ["NA", "NA", "NA"],
                    
    'Model'      : ["NA", "NA", "NA"] }

<br>
<strong>Final model output and selection</strong>
<br>
The following table shows the summary of my models and the selected model to use to predict the birth weight.
<br>
As mentioned before, the KNN model is not included given that it produced results with low scores and big gaps between the test and train scores.
<br>

In [None]:
# final results

print(f"""
Model      Train Score      Test Score      Train-Test Gap
-----      -----------      ----------      ----------
OLS**      {lr_train_score}           {lr_test_score}           {lr_test_gap}
Lasso      {lasso_train_score}           {lasso_test_score}            {lasso_test_gap}
ARD        {ard_train_score}           {ard_test_score}           {ard_test_gap}

Note: My final model selection is marked by using double **""")