## Student's T-test on 60's US crime rate.

In this notebook, we will perform the Student's T-test on 60's US crime rate.

The idea behing the test is to check what variable has influence on the dataset. Basically, we will assume a *null hypothesis* that says that the attributes don't influence and check for the p-value. If the *p-value* < $5 \%$, we can reject the *null hypothesis* that is, the attribute influence in the dataset.

First things first, lets load our libraries and our data set.

In [1]:
import numpy as np           # great for general work, such as arrays 
import statsmodels.api as sm # will perform our test
import pandas as pd          # great for working with datasets

data = 'https://stats.idre.ucla.edu/wp-content/uploads/2016/02/crime.txt' # link for our dataset
df = pd.read_csv(data, delimiter= '\s+', header = None) # reads the .csv link, specifing its delimiter and no header

# CrimeRat: Crime rate: # of offenses reported to police per million population
# MaleTeen: The number of males of age 14-24 per 1000 population
# South : Indicator variable for Southern states (0 = No, 1 = Yes)
# Educ : Mean # of years of schooling for persons of age 25 or older
# Police60: 1960 per capita expenditure on police by state and local government
# Police59: 1959 per capita expenditure on police by state and local government
# Labor : Labor force participation rate per 1000 civilian urban males age 14-24
# Males : The number of males per 1000 females
# Pop : State population size in hundred thousands
# NonWhite: The number of non-whites per 1000 population
# Unemp1 : Unemployment rate of urban males per 1000 of age 14-24
# Unemp2 : Unemployment rate of urban males per 1000 of age 35-39
# Median : Median value of transferable goods and assets or family income in tens of $
# BelowMed: The number of families per 1000 earning below 1/2 the median income

df.columns = ['CrimeRat', 'MaleTeen', 'South', 'Educ', 'Police60', 'Police59', 'Labor', 'Males', 'Pop', 'NonWhite',
              'Unemp1', 'Unemp2', 'Median', 'BelowMed'] # here we set the header for each column

In [2]:
x = df.iloc[:, 1:14] # iloc selects data in the dataset by integer number of lines, arrays or slice
                     # In this case we want all columns from 1 to 14.
y = df.iloc[:, 0]    # for y, we want only the first column

In [3]:
x = sm.add_constant(x) # adds a column of 1s to the dataset, needed by the model

In [4]:
result = sm.OLS(y, x).fit() #Ordinary Least Squares fit

In [5]:
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:               CrimeRat   R-squared:                       0.769
Model:                            OLS   Adj. R-squared:                  0.678
Method:                 Least Squares   F-statistic:                     8.462
Date:                Thu, 22 Sep 2022   Prob (F-statistic):           3.69e-07
Time:                        16:09:24   Log-Likelihood:                -203.52
No. Observations:                  47   AIC:                             435.0
Df Residuals:                      33   BIC:                             460.9
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       -691.8376    155.888     -4.438      0.0

However, the *p-value* displayed is a double-tailed *p-value*, and so, we need to divide by two to get the one-tailed *p-value*: 

In [6]:
result.pvalues/2

const       0.000048
MaleTeen    0.009653
South       0.290585
Educ        0.004530
Police60    0.069178
Police59    0.282646
Labor       0.395434
Males       0.219029
Pop         0.375981
NonWhite    0.455618
Unemp1      0.088992
Unemp2      0.022035
Median      0.101658
BelowMed    0.000956
dtype: float64

With that, we can reject the *null hypothesis* for MaleTeen, Educ, Unemp2 and BelowMed that is, the **number of males of age 14-24, schooling, number of unemployed males of age 35-39 and family income influence in crime rate** in the US, 1960. 

That is to say, if we would build a model for crime rate - say, a multiple linear regression model -, these would be the independent variables for the predictor.