## Challenge_Distribution

In this challenge you consider the real estate data from the city of Ames, Iowa. The details of every real estate transaction in Ames is recorded by the City Assessor's office (https://www.openintro.org/stat/data/ames.csv).

**Based on this data, perform the following tasks**:

- Take a random sample of size 50 from <code>price</code>. Using this sample, what is your best point estimate of the population mean?


- Since you have access to the population, simulate the sampling distribution for the average home price in Ames by taking 5000 samples from the population of size 50 and computing 5000 sample means. Store these means in a vector called <code>sample_means50</code>. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be? Finally, calculate and report the population mean.


- Change your sample size from 50 to 150, then compute the sampling distribution using the same method as above, and store these means in a new vector called <code>sample_means150</code>. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?


- Of the sampling distributions from 2 and 3, which has a smaller spread? If we’re concerned with making estimates that are more often close to the true value, would we prefer a distribution with a large or small spread?


> Import libraries

In [24]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import random
import math

> get data and parse to dataframe

In [8]:
estate_data = pd.read_csv('https://www.openintro.org/stat/data/ames.csv')

In [5]:
estate_data.head()

Unnamed: 0,Order,PID,MS.SubClass,MS.Zoning,Lot.Frontage,Lot.Area,Street,Alley,Lot.Shape,Land.Contour,...,Pool.Area,Pool.QC,Fence,Misc.Feature,Misc.Val,Mo.Sold,Yr.Sold,Sale.Type,Sale.Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,...,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,...,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,...,0,,MnPrv,,0,3,2010,WD,Normal,189900


> Find column with price

In [13]:
estate_data.columns

Index(['Order', 'PID', 'MS.SubClass', 'MS.Zoning', 'Lot.Frontage', 'Lot.Area',
       'Street', 'Alley', 'Lot.Shape', 'Land.Contour', 'Utilities',
       'Lot.Config', 'Land.Slope', 'Neighborhood', 'Condition.1',
       'Condition.2', 'Bldg.Type', 'House.Style', 'Overall.Qual',
       'Overall.Cond', 'Year.Built', 'Year.Remod.Add', 'Roof.Style',
       'Roof.Matl', 'Exterior.1st', 'Exterior.2nd', 'Mas.Vnr.Type',
       'Mas.Vnr.Area', 'Exter.Qual', 'Exter.Cond', 'Foundation', 'Bsmt.Qual',
       'Bsmt.Cond', 'Bsmt.Exposure', 'BsmtFin.Type.1', 'BsmtFin.SF.1',
       'BsmtFin.Type.2', 'BsmtFin.SF.2', 'Bsmt.Unf.SF', 'Total.Bsmt.SF',
       'Heating', 'Heating.QC', 'Central.Air', 'Electrical', 'X1st.Flr.SF',
       'X2nd.Flr.SF', 'Low.Qual.Fin.SF', 'Gr.Liv.Area', 'Bsmt.Full.Bath',
       'Bsmt.Half.Bath', 'Full.Bath', 'Half.Bath', 'Bedroom.AbvGr',
       'Kitchen.AbvGr', 'Kitchen.Qual', 'TotRms.AbvGrd', 'Functional',
       'Fireplaces', 'Fireplace.Qu', 'Garage.Type', 'Garage.Yr.Blt',
    

>Get population data 
> FInd population mean

In [45]:
population_sales = estate_data.SalePrice.values.tolist()
population_sales = np.array(population_sales)

In [46]:
population_sales_mean = np.mean(population_sales).round(4)
population_sales_mean

180796.0601

> Get sample data from population and find sample mean

In [40]:
sample_sales = np.random.choice(a=population_sales, size=50, replace= True)
sample_sales

array([163000, 120000,  84900, 142000, 178750, 203000, 242000, 127000,
       162900,  98000, 129000, 153600, 207500, 345000, 137500, 131900,
       246000, 217000, 156000, 183500, 227875, 200000, 172500, 130000,
       392000, 129800, 157000, 134900, 128000, 212000, 160500, 129500,
       139000, 183600, 136500, 147000, 140000, 130250, 174000, 270000,
       130500, 280000,  81500, 160000, 113722, 157000, 163500, 160000,
       194000, 203160])

In [48]:
sample_sales_mean = np.mean(sample_sales).round(4)
sample_sales_mean

171327.14

> Best Point estimate

In [49]:
best_pe = (population_sales_mean - sample_sales_mean).round(4)
best_pe

9468.9201

> The best point estiamte for the population mean price is 9468.9201