# Lab_3-3: Hypothesis Testing
This assignment requires more individual learning than previous assignments - you are encouraged to check out the [pandas documentation](http://pandas.pydata.org/pandas-docs/stable/) to find functions or methods you might not have used yet, or ask questions on [Stack Overflow](http://stackoverflow.com/) and tag them as pandas and python related. And of course, the discussion forums are open for interaction with your peers and the course staff.

Definitions:
* A _quarter_ is a specific three month period, Q1 is January through March, Q2 is April through June, Q3 is July through September, Q4 is October through December.
* A _recession_ is defined as starting with two consecutive quarters of GDP decline, and ending with two consecutive quarters of GDP growth.
* A _recession bottom_ is the quarter within a recession which had the lowest GDP.
* A _university town_ is a city which has a high percentage of university students compared to the total population of the city.

**Hypothesis**: University towns have their mean housing prices less affected by recessions. Run a t-test to compare the ratio of the mean price of houses in university towns the quarter before the recession starts compared to the recession bottom: `price_ratio=quarter_before_recession/recession_bottom`

The following data files are available for this assignment:

* From the [Zillow research data site](http://www.zillow.com/research/data/), there is housing data for the United States. In particular, the datafile for [all homes at a city level](http://files.zillowstatic.com/research/public/City/City_Zhvi_AllHomes.csv), `City_Zhvi_AllHomes.csv`, has median home sale prices at a fine grained level.
* From the Wikipedia page on college towns, there is a list of [university towns in the United States](https://en.wikipedia.org/wiki/List_of_college_towns#College_towns_in_the_United_States) which has been copied and pasted into the file `university_towns.txt`.
* From the Bureau of Economic Analysis, US Department of Commerce, the [GDP over time](http://www.bea.gov/national/index.htm#gdp) of the United States in current dollars (use the chained value in 2009 dollars), in quarterly intervals, in the file `gdplev.xls`. For this lab, only look at GDP data from the first quarter of 2000 onward.

In [1]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind
import csv

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Each function in this assignment below is worth 10%, with the exception of `run_ttest()`, which is worth 50%.

In [3]:
# Use this dictionary to map state names to two letter acronyms
states = {'OH': 'Ohio', 'KY': 'Kentucky', 'AS': 'American Samoa', 'NV': 'Nevada', 'WY': 'Wyoming', 'NA': 'National', 'AL': 'Alabama', 'MD': 'Maryland', 'AK': 'Alaska', 'UT': 'Utah', 'OR': 'Oregon', 'MT': 'Montana', 'IL': 'Illinois', 'TN': 'Tennessee', 'DC': 'District of Columbia', 'VT': 'Vermont', 'ID': 'Idaho', 'AR': 'Arkansas', 'ME': 'Maine', 'WA': 'Washington', 'HI': 'Hawaii', 'WI': 'Wisconsin', 'MI': 'Michigan', 'IN': 'Indiana', 'NJ': 'New Jersey', 'AZ': 'Arizona', 'GU': 'Guam', 'MS': 'Mississippi', 'PR': 'Puerto Rico', 'NC': 'North Carolina', 'TX': 'Texas', 'SD': 'South Dakota', 'MP': 'Northern Mariana Islands', 'IA': 'Iowa', 'MO': 'Missouri', 'CT': 'Connecticut', 'WV': 'West Virginia', 'SC': 'South Carolina', 'LA': 'Louisiana', 'KS': 'Kansas', 'NY': 'New York', 'NE': 'Nebraska', 'OK': 'Oklahoma', 'FL': 'Florida', 'CA': 'California', 'CO': 'Colorado', 'PA': 'Pennsylvania', 'DE': 'Delaware', 'NM': 'New Mexico', 'RI': 'Rhode Island', 'MN': 'Minnesota', 'VI': 'Virgin Islands', 'NH': 'New Hampshire', 'MA': 'Massachusetts', 'GA': 'Georgia', 'ND': 'North Dakota', 'VA': 'Virginia'}

## Question 0

Let's get the list of university towns first:

In [4]:
def get_list_of_university_towns():
    '''Returns a DataFrame of towns and the states they are in from the
    university_towns.txt list. The format of the DataFrame should like be:
    DataFrame( [ ["Massachusetts", "Boston"], ["Massachusetts", "Fitchburg"] ],
    columns=["State", "RegionName"]  )



    The following cleaning needs to be done:



    1. For "State", removing characters from "[" to the end.
    2. For "RegionName", when applicable, removing every character from " (" to the end.
    3. Depending on how you read the data, you may need to remove newline character '\n'. '''
    DATA_PATH = r'/content/drive/MyDrive/Colab Notebooks/ML Class/Week 8/Hypothesis test lab/university_towns.txt'

    university_towns=[]
    State = None
    town = None
    with open(DATA_PATH) as file:
      for i in file:
        if  '[edit]' in i:  #Trying to find the state
          State=i.replace('[edit]',"").strip() #Getting the state
        else:
          town = i.split('(')[0].strip() #Splitting everything before ( cause thats where the uni start and taking the first element which is the town

          university_towns.append([State,town]) #Appending values


    return university_towns

# Testing
uni=get_list_of_university_towns()
print(uni)

[['Alabama', 'Auburn'], ['Alabama', 'Florence'], ['Alabama', 'Jacksonville'], ['Alabama', 'Livingston'], ['Alabama', 'Montevallo'], ['Alabama', 'Troy'], ['Alabama', 'Tuscaloosa'], ['Alabama', 'Tuskegee'], ['Alaska', 'Fairbanks'], ['Arizona', 'Flagstaff'], ['Arizona', 'Tempe'], ['Arizona', 'Tucson'], ['Arkansas', 'Arkadelphia'], ['Arkansas', 'Conway'], ['Arkansas', 'Fayetteville'], ['Arkansas', 'Jonesboro'], ['Arkansas', 'Magnolia'], ['Arkansas', 'Monticello'], ['Arkansas', 'Russellville'], ['Arkansas', 'Searcy'], ['California', 'Angwin'], ['California', 'Arcata'], ['California', 'Berkeley'], ['California', 'Chico'], ['California', 'Claremont'], ['California', 'Cotati'], ['California', 'Davis'], ['California', 'Irvine'], ['California', 'Isla Vista'], ['California', 'University Park, Los Angeles'], ['California', 'Merced'], ['California', 'Orange'], ['California', 'Palo Alto'], ['California', 'Pomona'], ['California', 'Redlands'], ['California', 'Riverside'], ['California', 'Sacramento']

## Question 1

Lets' check the year and quarter of the recession start time next:

In [5]:
def get_recession_start():
    '''Returns the year and quarter of the recession start time as a
    string value in a format such as 2005q3'''
    #read data
    Data=r"/content/drive/MyDrive/Colab Notebooks/ML Class/Week 8/Hypothesis test lab/gdplev.xls"
    df = pd.read_excel(Data, usecols="E:G", skiprows=220, nrows=66, header=None)
    df.columns = ['Quarter', 'GDPinBillions', 'GDPinBillionsChanged2009Dollars']
    df["GDPDiff"]=df["GDPinBillions"].diff()
    df = df.fillna(0)
    print(len(df))
    #find recession start via for loop
    for i in range(len(df)-1):
       if df["GDPDiff"].iloc[i+1] < 0 and df["GDPDiff"].iloc[i + 2] < 0:
          ressesion_start=df["Quarter"].iloc[i-1]

    return ressesion_start,df



rs,df=get_recession_start()
print(df)
print(rs)


66
   Quarter  GDPinBillions  GDPinBillionsChanged2009Dollars  GDPDiff
0   2000q1        10031.0                          12359.1      0.0
1   2000q2        10278.3                          12592.5    247.3
2   2000q3        10357.4                          12607.7     79.1
3   2000q4        10472.3                          12679.3    114.9
4   2001q1        10508.1                          12643.3     35.8
..     ...            ...                              ...      ...
61  2015q2        17998.3                          16374.2    214.7
62  2015q3        18141.9                          16454.9    143.6
63  2015q4        18222.8                          16490.7     80.9
64  2016q1        18281.6                          16525.0     58.8
65  2016q2        18450.1                          16583.1    168.5

[66 rows x 4 columns]
2008q3


In [6]:
print(df)

   Quarter  GDPinBillions  GDPinBillionsChanged2009Dollars  GDPDiff
0   2000q1        10031.0                          12359.1      0.0
1   2000q2        10278.3                          12592.5    247.3
2   2000q3        10357.4                          12607.7     79.1
3   2000q4        10472.3                          12679.3    114.9
4   2001q1        10508.1                          12643.3     35.8
..     ...            ...                              ...      ...
61  2015q2        17998.3                          16374.2    214.7
62  2015q3        18141.9                          16454.9    143.6
63  2015q4        18222.8                          16490.7     80.9
64  2016q1        18281.6                          16525.0     58.8
65  2016q2        18450.1                          16583.1    168.5

[66 rows x 4 columns]


## Question 2

Let's also get the year and quarter of the recession end time:

In [7]:
def get_recession_end():
    '''Returns the year and quarter of the recession end time as a
    string value in a format such as 2005q3'''
    Data=r"/content/drive/MyDrive/Colab Notebooks/ML Class/Week 8/Hypothesis test lab/gdplev.xls"
    df = pd.read_excel(Data, usecols="E:G", skiprows=220, nrows=66, header=None)
    df.columns = ['Quarter', 'GDPinBillions', 'GDPinBillionsChanged2009Dollars']
    df["GDPDiff"]=df["GDPinBillions"].diff()
    df = df.fillna(0)
    print(len(df))
    #Find recesssion end via for loop
    for i in range(len(df)-4):
        if df["GDPDiff"].iloc[i] < 0 and df["GDPDiff"].iloc[i + 1] < 0 and  df["GDPDiff"].iloc[i + 2] > 0 and df["GDPDiff"].iloc[i + 3] > 0:
          ressesion_end=df["Quarter"].iloc[i+3]


    return ressesion_end,df

re,df=get_recession_end()
print(re)

66
2009q4


In [8]:
print(df)

   Quarter  GDPinBillions  GDPinBillionsChanged2009Dollars  GDPDiff
0   2000q1        10031.0                          12359.1      0.0
1   2000q2        10278.3                          12592.5    247.3
2   2000q3        10357.4                          12607.7     79.1
3   2000q4        10472.3                          12679.3    114.9
4   2001q1        10508.1                          12643.3     35.8
..     ...            ...                              ...      ...
61  2015q2        17998.3                          16374.2    214.7
62  2015q3        18141.9                          16454.9    143.6
63  2015q4        18222.8                          16490.7     80.9
64  2016q1        18281.6                          16525.0     58.8
65  2016q2        18450.1                          16583.1    168.5

[66 rows x 4 columns]


## Question 3

Then, let's get the year and quarter of the recession bottom time:

In [9]:
def get_recession_bottom():
    '''Returns the year and quarter of the recession bottom time as a
    string value in a format such as 2005q3'''
    #Read data
    Data=r"/content/drive/MyDrive/Colab Notebooks/ML Class/Week 8/Hypothesis test lab/gdplev.xls"

    df = pd.read_excel(Data, usecols="E:G", skiprows=220, nrows=66, header=None)

    df.columns = ['Quarter', 'GDPinBillions', 'GDPinBillionsChanged2009Dollars']
    df["GDPDiff"]=df["GDPinBillions"].diff()
    df = df.fillna(0)
    print(len(df))
    #Find recession bottom via for loop
    for i in range(len(df)-1):
       if df["GDPDiff"].iloc[i+1] < 0 and df["GDPDiff"].iloc[i + 2] < 0:
          ressesion_bottom=df["Quarter"].iloc[i+2]
    return ressesion_bottom,df

rb,df=get_recession_bottom()
print(rb)

66
2009q2


In [11]:
print(df)

       RegionID           RegionName State                           Metro  \
0          6181             New York    NY                        New York   
1         12447          Los Angeles    CA  Los Angeles-Long Beach-Anaheim   
2         17426              Chicago    IL                         Chicago   
3         13271         Philadelphia    PA                    Philadelphia   
4         40326              Phoenix    AZ                         Phoenix   
...         ...                  ...   ...                             ...   
10725    398292  Town of Wrightstown    WI                       Green Bay   
10726    398343               Urbana    NY                         Corning   
10727    398496          New Denmark    WI                       Green Bay   
10728    398839               Angels    CA                               0   
10729    399114              Holland    WI                       Sheboygan   

         CountyName  SizeRank         2000q1         2000q2    

## Question 4

And then we can convert the housing data to quarters (as defined above!) and return the mean values:

In [10]:
def convert_housing_data_to_quarters():
    '''Converts the housing data to quarters and returns it as mean
    values in a dataframe. This dataframe should be a dataframe with
    columns for 2000q1 through 2016q3, and should have a multi-index
    in the shape of ["State","RegionName"].ac

    Note: Quarters are defined in the lab description above and they are
    not arbitrary three month periods.

    A quarter is a specific three month period, Q1 is January through March, Q2 is April through June, Q3 is July through September, Q4 is October through December
q1 01-03 q2 04-06 q3 07-09 q4 10-12

    yyyy-mm
    The resulting dataframe should have 67 columns, and 10,730 rows.
    '''
    #Read data
    Data=r'/content/drive/MyDrive/Colab Notebooks/ML Class/Week 8/Hypothesis test lab/City_Zhvi_AllHomes.csv'
    df = pd.read_csv(Data)
    df=df.fillna(0)
    df1=df.copy()
    quarter_mapping = {
        '01': 'q1', '02': 'q1', '03': 'q1',
        '04': 'q2', '05': 'q2', '06': 'q2',
        '07': 'q3', '08': 'q3', '09': 'q3',
        '10': 'q4', '11': 'q4', '12': 'q4'
    }
    quarters = []
    qm={}
    #Map the months
    for i in range(len(df.columns)-6):
      year=df.columns[i+6][0:4]
      month=df.columns[i+6][5:7]
      quarter=quarter_mapping.get(month)
      q=year+quarter
      quarters.append(q)

    df.columns.values[6:] = quarters

    #Map the years
    for year in range(2000, 2017):
      for quarter in range(1, 5):
          label = f"{year}q{quarter}"
          if label in df.columns:
            quarter_columns = df.filter(like=label).columns
            if not quarter_columns.empty:

               qm[label] = df[quarter_columns].mean(axis=1)


    df = pd.DataFrame(qm)
    a = ['RegionID', 'RegionName', 'State', 'Metro', 'CountyName', 'SizeRank']
    df = pd.concat([df1[a], df], axis=1)
    df2=pd.DataFrame(qm)
    return df2,df



df2,df = convert_housing_data_to_quarters()
df.head()

Unnamed: 0,RegionID,RegionName,State,Metro,CountyName,SizeRank,2000q1,2000q2,2000q3,2000q4,...,2014q2,2014q3,2014q4,2015q1,2015q2,2015q3,2015q4,2016q1,2016q2,2016q3
0,6181,New York,NY,New York,Queens,1,0.0,0.0,0.0,0.0,...,515466.666667,522800.0,528066.666667,532266.666667,540800.0,557200.0,572833.333333,582866.666667,591633.333333,587200.0
1,12447,Los Angeles,CA,Los Angeles-Long Beach-Anaheim,Los Angeles,2,207066.666667,214466.666667,220966.666667,226166.666667,...,498033.333333,509066.666667,518866.666667,528800.0,538166.666667,547266.666667,557733.333333,566033.333333,577466.666667,584050.0
2,17426,Chicago,IL,Chicago,Cook,3,138400.0,143633.333333,147866.666667,152133.333333,...,192633.333333,195766.666667,201266.666667,201066.666667,206033.333333,208300.0,207900.0,206066.666667,208200.0,212000.0
3,13271,Philadelphia,PA,Philadelphia,Philadelphia,4,53000.0,53633.333333,54133.333333,54700.0,...,113733.333333,115300.0,115666.666667,116200.0,117966.666667,121233.333333,122200.0,123433.333333,126933.333333,128700.0
4,40326,Phoenix,AZ,Phoenix,Maricopa,5,111833.333333,114366.666667,116000.0,117400.0,...,164266.666667,165366.666667,168500.0,171533.333333,174166.666667,179066.666667,183833.333333,187900.0,191433.333333,195200.0


In [None]:
df2.head()

Unnamed: 0,2000q1,2000q2,2000q3,2000q4,2001q1,2001q2,2001q3,2001q4,2002q1,2002q2,...,2014q2,2014q3,2014q4,2015q1,2015q2,2015q3,2015q4,2016q1,2016q2,2016q3
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,515466.666667,522800.0,528066.666667,532266.666667,540800.0,557200.0,572833.333333,582866.666667,591633.333333,587200.0
1,207066.666667,214466.666667,220966.666667,226166.666667,233000.0,239100.0,245066.666667,253033.333333,261966.666667,272700.0,...,498033.333333,509066.666667,518866.666667,528800.0,538166.666667,547266.666667,557733.333333,566033.333333,577466.666667,584050.0
2,138400.0,143633.333333,147866.666667,152133.333333,156933.333333,161800.0,166400.0,170433.333333,175500.0,177566.666667,...,192633.333333,195766.666667,201266.666667,201066.666667,206033.333333,208300.0,207900.0,206066.666667,208200.0,212000.0
3,53000.0,53633.333333,54133.333333,54700.0,55333.333333,55533.333333,56266.666667,57533.333333,59133.333333,60733.333333,...,113733.333333,115300.0,115666.666667,116200.0,117966.666667,121233.333333,122200.0,123433.333333,126933.333333,128700.0
4,111833.333333,114366.666667,116000.0,117400.0,119600.0,121566.666667,122700.0,124300.0,126533.333333,128366.666667,...,164266.666667,165366.666667,168500.0,171533.333333,174166.666667,179066.666667,183833.333333,187900.0,191433.333333,195200.0


In [23]:
!pip install ipdb

Collecting ipdb
  Downloading ipdb-0.13.13-py3-none-any.whl.metadata (14 kB)
Collecting jedi>=0.16 (from ipython>=7.31.1->ipdb)
  Downloading jedi-0.19.1-py2.py3-none-any.whl.metadata (22 kB)
Downloading ipdb-0.13.13-py3-none-any.whl (12 kB)
Downloading jedi-0.19.1-py2.py3-none-any.whl (1.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: jedi, ipdb
Successfully installed ipdb-0.13.13 jedi-0.19.1


## Question 5

Finally, let's run the actual t-test now:

In [36]:
import ipdb #Used to debugging
def run_ttest():
    '''First, creates new data showing the decline or growth of housing prices
    between the recession start and the recession bottom. Then, runs a ttest
    comparing the university town values to the non-university towns values,
    return whether the alternative hypothesis (that the two groups are the same)
    is true or not as well as the p-value of the confidence.

    Return the tuple (different, p, better) where: different=True if the t-test is
    True at a p<0.01 (we reject the null hypothesis), or different=False if
    otherwise (we cannot reject the null hypothesis). The variable p should
    be equal to the exact p value returned from scipy.stats.ttest_ind(). The
    value for better should be either "university town" or "non-university town"
    depending on which has a lower mean price ratio (which is equivilent to a
    reduced market loss).

    Hypothesis: University towns have their mean housing prices
    less affected by recessions. Run a t-test to compare the ratio of the mean
    price of houses in university towns the quarter before the recession starts compared to the recession bottom:



    '''

    df1, housing_df = convert_housing_data_to_quarters()
    uniTowns = pd.DataFrame(get_list_of_university_towns(), columns=["State", "RegionName"])


    recession_start, _ = get_recession_start()
    recession_bottom, _ = get_recession_bottom()

    housing_df['State'] = housing_df['State'].replace(states)

    #Calculating price ratio
    housing_df["price_ratio"] = housing_df[recession_start] / housing_df[recession_bottom]


    #Dropping NAN
    housing_df.dropna(subset=["price_ratio"], inplace=True)

    #sPLITTING UP DATA
    housing_df["b_uniTown"] = False
    for i, row in housing_df.iterrows():
      for j, uni_row in uniTowns.iterrows():
          if row["State"] == uni_row["State"] and row["RegionName"] == uni_row["RegionName"]:
              housing_df.at[i, "b_uniTown"] = True
              break


    uniTown_prices = housing_df[housing_df["b_uniTown"]]["price_ratio"]
    nonUniTown_prices= housing_df[~housing_df["b_uniTown"]]["price_ratio"]



    t_value, p = ttest_ind(uniTown_prices, nonUniTown_prices, nan_policy='omit')
    different = p < 0.01
    uniTown_mean = uniTown_prices.mean()
    nonUniTown_mean = nonUniTown_prices.mean()
    better = None
    if uniTown_mean < nonUniTown_mean:
      better = "university town"
    else:
      better="non-university town"

    return different, p, better

print(run_ttest())

66
66
(False, 0.046040975395856894, 'university town')
