# Preparation for exercise

You will expect to use the commands here to finish the case study exercise in your hand-on session. 

A new library we will use is `matplotlib`, which is Python's fundamental plotting library. It is flexible and gives the user full control over all elements of the plots. In this session, we will use a subpackage called `pyplot`of `matplotlib`. 

Just as we imported a pacakge, but this time we give it an abbreviated name - `matplotlib.pyplot`

`seaborn` builds on `matplotlib` by providing a higher-level interface for statistical graphics. It provides an interface to produce prettier and more complex visualizations with fewer lines of code.

We can use the `statsmodels` library to perform our simple linear regression. We will use the `formula` API from `statsmodels`.

In [None]:
# importing all the package we need in this notebook

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import statsmodels.formula.api as smf
from statsmodels.formula.api import ols

## Case 1
- The manager of the factory wants to get a better understanding of overhead costs. 
- Some of these overhead costs are fixed in the sense that they do not vary appreciably with the volume of work being done, whereas others are variable and do vary directly with the volume of work. 
- The fixed overhead costs tend to come from the supervision, depreciation, and miscellaneous categories, whereas the variable overhead costs tend to come from the indirect labor, supplies, payroll taxes, and overtime categories. 
- However, it is not easy to draw a clear line between the fixed and variable overhead components.
- Machine Hours: number of machine hours used during the month
- Production Runs: the number of separate production runs during the month

### Simple Scatterplot

Please note that scatter plots are used when a continous variable is plotted against another continuous variable. 

In [None]:
# create a dataframe named df_listing1 from a raw csv file 

df_listing1 = pd.read_csv('../data/Lecture_2_Overhead Costs.csv')

# By default plt.plot will generate lines

# We pass an 'o' parameter to tell plt.plot to use dots in the digram

plt.plot(df_listing1['Machine Hours'], df_listing1['Overhead'], 'o')

## Multiple Scatterplots comparison

While we could visualise the relationship through scatterplot individually, one at a time, `matplotlib` offers a much handier way to create subplots. That is, you can specify the dimensions of your final figure, and put in smaller plots to fit the specified dimensions. In this way, you can present your results in a single figure, instead of completely separate ones.

The `subplot` syntax takes three parameters:

1. Number of rows in the figure for subplots

2. Number of columns in the figure for subplots

3. Subplot location

In [None]:
fig = plt.figure()
fig.set_size_inches(11, 20)
fig.set_dpi(100)
axes1 = fig.add_subplot(3,1,1)
axes2 = fig.add_subplot(3,1,2)
axes3 = fig.add_subplot(3,1,3)

axes1.plot(df_listing1['Machine Hours'], df_listing1['Overhead'], 'o')
axes2.plot(df_listing1['Production Runs'], df_listing1['Overhead'], 'o')
axes3.plot(df_listing1['Production Runs'], df_listing1['Machine Hours'], 'o')

axes1.set_title('Scatterplot of Overhead Versus Machine Hours', pad=30, fontsize=20)
axes1.set_xlabel('Machine Hours', fontsize=15)
axes1.set_ylabel('Overhead', fontsize=15)

axes2.set_title('Scatterplot of Overhead Versus Production Runs', pad=30, fontsize=20)
axes2.set_xlabel('Production Runs', fontsize=15)
axes2.set_ylabel('Overhead', fontsize=15)

axes3.set_title('Scatterplot of Machine Hours Versus Production Runs', pad=30, fontsize=20)
axes3.set_xlabel('Machine Hours', fontsize=15)
axes3.set_ylabel('Production Runs', fontsize=15)

plt.subplots_adjust(wspace=0.4, 
                    hspace=0.4)
plt.show()


## Pearson Correlation

The syntax we will use to obtain the correlation measures is `DataFrame.corr()` as we defined the dataframe in this example as df_listing1. Therefore the command we used is `df_listing1.corr()`. Here, the method of pearson is used.

In [None]:
df_listing1.corr(method ='pearson')

## Case 2

Pharmex has collected data from 50 randomly selected metropolitan regions. There are two variables: Pharmex’s promotional expenditures as a percentage of those of the leading competitor (“Promote”) and Pharmex’s sales as a percentage of those of the leading competitor (“Sales”).

Objective:
- Using a scatterplot to examine the relationship between promotional expenditures and sales at Pharmex.
- Find the least squares line for sales as a function of promotional expenses at Pharmex.
- Interpret the model fit of the regression model.

### Scatterplot Regression Lines - Drugstore Data

The `seaborn` library is sued to plot two variables. Different with the `matplotlib`, we use `regplot` command in `seaborn` library. It will plot a scatterplot and also fit a regression line. 

In [None]:
df_listing2 = pd.read_csv('../data/Lecture_1_Drugstore Sales.csv')
scatter, ax = plt.subplots()
ax = sns.regplot(x='Promote',y='Sales',data=df_listing2)
plt.show()

## Case 3

Fifth National Bank of Springfield is facing a gender discrimination suit. The charge is that its female employees receive substantially smaller slaries than its male employees. The bank’s employee data are listed in the file of Bank Slaries.csv. For each of its 208 employees, the data includes variables:

- Education: education level, a categorical variable with categories1(Fnished high school), 2(Fnished some college courses), 3(obtained a bachelor’s degree), 4(took some graduate courses), 5(obtained a graduate degree)•
- Grade: a categorical variable indicating the current job level, the possible levels being1through6(6is highest)
- Years1: years of experience with this bank
- Years2: number of years of work experience at another bank prior to working at Fifth National• 
- Age: employee’s current age
- Gender: a categorical variable with values “Female” and “Male”
- PC Job: a categorical yes/no variable depending on whether the employee’s current job is computer-related
- Salary: current annual salary

Objective:
To analyse whether the bank discriminates against females in terms of salary.

### Simple Linear Regression

To perform  simple linear regression, we use the `ols` function, which computes the ordinary least squares value; it is one method to estimate parameters in a linear regression. Recall that the formula for a line is y = mx + b, where y is our response variable, x is our predictor, b is the intercept, and m is the slope, the parameter we are estimating.

The formula notation has two parts, separated by a tilde, ~. To the left of the tilde is the response variable, and to the right of the tilde is the predictor.

In [None]:
model = smf.ols(formula='Sales ~ Promote', data=df_listing2)
results = model.fit()
print(results.summary())

## Multiple Regression

Fitting a multiple regression model to a data set is very similar to fitting a simple linear regression model. Using the formula interface, we “add” the other covariates to the right-hand side.

In [None]:
df_listing3 = pd.read_csv('../data/Lecture_3_Bank Salaries.csv')
df_listing3 = pd.get_dummies(df_listing3, columns=['Gender'])

In [None]:
def format_Salary(Salary):
    return(int(Salary.replace('$','').replace(',','')))

df_listing3['SalaryInt'] = df_listing3['Salary'].apply(format_Salary)

model = smf.ols(formula='SalaryInt~Years1+Years2+Gender_Female', data=df_listing3)
results = model.fit()
print(results.summary())

## Interaction analysis

`Seaborn` also combines simple statistics fits with plotting on pandas dataframe. We can use the dummy varible of Gender_Female as hue to plotted the relationship between working experience and salary

In [None]:
model = smf.ols(formula='SalaryInt~Years1+Gender_Female+Years1:Gender_Female', data=df_listing3)
results = model.fit()
print(results.summary())

#Illustrating the relationships and the dummy variables can be plotted as the hue

sns.pairplot(data=df_listing3, vars=['Years1', 'SalaryInt'], kind='reg', hue='Gender_Female')