# **Regression Analysis**


The goal of regression analysis is to describe the relationship between one set of variables called the dependent variables, and another set of variables, called independent or explanatory variables. When there is only one explanatory variable, it is called simple regression.


In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

In [2]:
ratings_url = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ST0151EN-SkillsNetwork/labs/teachingratings.csv'
ratings_df = pd.read_csv(ratings_url)

### Regression with T-test: Using the teachers rating data set, does gender affect teaching evaluation rates?



-   $H_0: β1$ = 0 (Gender has no effect on teaching evaluation scores)
-   $H_1: β1$ is not equal to 0 (Gender has an effect on teaching evaluation scores)


We will use the female variable. female = 1 and male = 0


In [3]:
## X is the input variables (or independent variables)
X = ratings_df['female']
## y is the target/dependent variable
y = ratings_df['eval']
## add an intercept (beta_0) to our model
X = sm.add_constant(X) 

model = sm.OLS(y, X).fit()
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,eval,R-squared:,0.022
Model:,OLS,Adj. R-squared:,0.02
Method:,Least Squares,F-statistic:,10.56
Date:,"Mon, 12 Jul 2021",Prob (F-statistic):,0.00124
Time:,16:07:12,Log-Likelihood:,-378.5
No. Observations:,463,AIC:,761.0
Df Residuals:,461,BIC:,769.3
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.0690,0.034,121.288,0.000,4.003,4.135
female,-0.1680,0.052,-3.250,0.001,-0.270,-0.066

0,1,2,3
Omnibus:,17.625,Durbin-Watson:,1.209
Prob(Omnibus):,0.0,Jarque-Bera (JB):,18.97
Skew:,-0.496,Prob(JB):,7.6e-05
Kurtosis:,2.981,Cond. No.,2.47


**Conclusion:** we reject the null hypothesis as there is evidence that there is a difference in mean evaluation scores based on gender. The coefficient -0.1680 means that females get 0.168 scores less than men.


### Regression with ANOVA: Using the teachers' rating data set, does beauty  score for instructors  differ by age?


State the Hypothesis:

-   $H_0: µ1 = µ2 = µ3$ (the three population means are equal)
-   $H_1:$ At least one of the means differ


In [4]:
ratings_df.loc[(ratings_df['age'] <= 40), 'age_group'] = '40 years and younger'
ratings_df.loc[(ratings_df['age'] > 40)&(ratings_df['age'] < 57), 'age_group'] = 'between 40 and 57 years'
ratings_df.loc[(ratings_df['age'] >= 57), 'age_group'] = '57 years and older'

Use OLS function from the statsmodel library


In [5]:
from statsmodels.formula.api import ols
lm = ols('beauty ~ age_group', data = ratings_df).fit()
table= sm.stats.anova_lm(lm)
print(table)

              df      sum_sq    mean_sq          F        PR(>F)
age_group    2.0   20.422744  10.211372  17.597559  4.322549e-08
Residual   460.0  266.925153   0.580272        NaN           NaN


**Conclusion:**  we will reject the null hypothesis since the p-value is less than 0.05 there is significant evidence that at least one of the means differ.


### Regression with ANOVA option 2


Create dummy variables - A dummy variable is a numeric variable that represents categorical data, such as gender, race, etc. Dummy variables are dichotomous, i.e they can take on only two quantitative values.


In [6]:
X = pd.get_dummies(ratings_df[['age_group']])

In [7]:
y = ratings_df['beauty']
## add an intercept (beta_0) to our model
X = sm.add_constant(X) 

model = sm.OLS(y, X).fit()
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,beauty,R-squared:,0.071
Model:,OLS,Adj. R-squared:,0.067
Method:,Least Squares,F-statistic:,17.6
Date:,"Mon, 12 Jul 2021",Prob (F-statistic):,4.32e-08
Time:,16:08:26,Log-Likelihood:,-529.47
No. Observations:,463,AIC:,1065.0
Df Residuals:,460,BIC:,1077.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0138,0.028,0.496,0.620,-0.041,0.069
age_group_40 years and younger,0.3224,0.058,5.574,0.000,0.209,0.436
age_group_57 years and older,-0.2596,0.056,-4.621,0.000,-0.370,-0.149
age_group_between 40 and 57 years,-0.0489,0.045,-1.081,0.280,-0.138,0.040

0,1,2,3
Omnibus:,11.586,Durbin-Watson:,0.434
Prob(Omnibus):,0.003,Jarque-Bera (JB):,12.114
Skew:,0.394,Prob(JB):,0.00234
Kurtosis:,2.913,Cond. No.,6620000000000000.0


### Correlation: Using the teachers' rating dataset, Is teaching evaluation score correlated with beauty score?


In [10]:
## X is the input variables (or independent variables)
X = ratings_df['beauty']
## y is the target/dependent variable
y = ratings_df['eval']
## add an intercept (beta_0) to our model
X = sm.add_constant(X) 

model = sm.OLS(y, X).fit()
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,eval,R-squared:,0.036
Model:,OLS,Adj. R-squared:,0.034
Method:,Least Squares,F-statistic:,17.08
Date:,"Mon, 12 Jul 2021",Prob (F-statistic):,4.25e-05
Time:,16:12:13,Log-Likelihood:,-375.32
No. Observations:,463,AIC:,754.6
Df Residuals:,461,BIC:,762.9
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,3.9983,0.025,157.727,0.000,3.948,4.048
beauty,0.1330,0.032,4.133,0.000,0.070,0.196

0,1,2,3
Omnibus:,15.399,Durbin-Watson:,1.238
Prob(Omnibus):,0.0,Jarque-Bera (JB):,16.405
Skew:,-0.453,Prob(JB):,0.000274
Kurtosis:,2.831,Cond. No.,1.27


In [13]:
ratings_df.head()

Unnamed: 0,minority,age,gender,credits,beauty,eval,division,native,tenure,students,allstudents,prof,PrimaryLast,vismin,female,single_credit,upper_division,English_speaker,tenured_prof,age_group
0,yes,36,female,more,0.289916,4.3,upper,yes,yes,24,43,1,0,1,1,0,1,1,1,40 years and younger
1,yes,36,female,more,0.289916,3.7,upper,yes,yes,86,125,1,0,1,1,0,1,1,1,40 years and younger
2,yes,36,female,more,0.289916,3.6,upper,yes,yes,76,125,1,0,1,1,0,1,1,1,40 years and younger
3,yes,36,female,more,0.289916,4.4,upper,yes,yes,77,123,1,1,1,1,0,1,1,1,40 years and younger
4,no,59,male,more,-0.737732,4.5,upper,yes,yes,17,20,2,0,0,0,0,1,1,1,57 years and older


**Conclusion:** p < 0.05 there is evidence of correlation between beauty and evaluation scores


### Question 1: Using the teachers' rating data set, does tenure affect beauty scores?

-   Use α = 0.05


In [12]:
y=ratings_df['tenured_prof']
X=ratings_df['beauty']
X=sm.add_constant(X)
model=sm.OLS(y,X).fit()
predictions = model.predict(X)
model.summary()

0,1,2,3
Dep. Variable:,tenured_prof,R-squared:,0.0
Model:,OLS,Adj. R-squared:,-0.002
Method:,Least Squares,F-statistic:,0.1689
Date:,"Mon, 12 Jul 2021",Prob (F-statistic):,0.681
Time:,16:12:28,Log-Likelihood:,-249.07
No. Observations:,463,AIC:,502.1
Df Residuals:,461,BIC:,510.4
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.7797,0.019,40.400,0.000,0.742,0.818
beauty,-0.0101,0.024,-0.411,0.681,-0.058,0.038

0,1,2,3
Omnibus:,88.308,Durbin-Watson:,0.327
Prob(Omnibus):,0.0,Jarque-Bera (JB):,141.061
Skew:,-1.349,Prob(JB):,2.34e-31
Kurtosis:,2.822,Cond. No.,1.27


#### p-value is greater than 0.05, so we fail to reject the null hypothesis as there is no evidence that the mean difference of tenured and untenured instructors are different


### Using the teachers' rating data set, does being an English speaker affect the number of students assigned to professors?

-   Use "allstudents"
-   Use α = 0.05 and α = 0.1 


In [14]:
y=ratings_df['allstudents']
X=ratings_df['English_speaker']
X=sm.add_constant(X)
model=sm.OLS(y,X).fit()
predictions = model.predict(X)
model.summary()


0,1,2,3
Dep. Variable:,allstudents,R-squared:,0.007
Model:,OLS,Adj. R-squared:,0.005
Method:,Least Squares,F-statistic:,3.476
Date:,"Mon, 12 Jul 2021",Prob (F-statistic):,0.0629
Time:,16:14:51,Log-Likelihood:,-2654.2
No. Observations:,463,AIC:,5312.0
Df Residuals:,461,BIC:,5321.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,29.6071,14.150,2.092,0.037,1.802,57.413
English_speaker,27.2158,14.598,1.864,0.063,-1.471,55.902

0,1,2,3
Omnibus:,429.792,Durbin-Watson:,0.708
Prob(Omnibus):,0.0,Jarque-Bera (JB):,10527.126
Skew:,4.129,Prob(JB):,0.0
Kurtosis:,24.852,Cond. No.,8.01


#### At α = 0.05, p-value is greater, we fail to reject the null hypothesis as there is no evidence that being a native English speaker or a non-native English speaker affects the number of students assigned to an instructor.
#### At α = 0.1, p-value is less, we reject the null hypothesis as there is evidence that there is a significant difference of mean number of students assigned to native English speakers vs non-native English speakers.

### Using the teachers' rating data set, what is the correlation between the number of students who participated in the evaluation survey and evaluation scores?

-   Use "students" variable


In [15]:
X = ratings_df['students']
y = ratings_df['eval']
X = sm.add_constant(X) 

model = sm.OLS(y, X).fit()
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,eval,R-squared:,0.001
Model:,OLS,Adj. R-squared:,-0.001
Method:,Least Squares,F-statistic:,0.5806
Date:,"Mon, 12 Jul 2021",Prob (F-statistic):,0.446
Time:,16:16:33,Log-Likelihood:,-383.46
No. Observations:,463,AIC:,770.9
Df Residuals:,461,BIC:,779.2
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,3.9823,0.033,119.689,0.000,3.917,4.048
students,0.0004,0.001,0.762,0.446,-0.001,0.002

0,1,2,3
Omnibus:,15.259,Durbin-Watson:,1.198
Prob(Omnibus):,0.0,Jarque-Bera (JB):,16.283
Skew:,-0.456,Prob(JB):,0.000291
Kurtosis:,2.888,Cond. No.,74.8


#### R-square is 0.001, R will be √0.001, correlation coefficient is 0.03 (close to 0). There is a very weak correlation between the number of students who participated in the evaluation survey and evaluation scores