### Data Scaling and Working with Dataframes
#### When training models on data, often times it is useful to scale input features, so they all have a similar range of values. There are multiple ways to apply this scaling, and you will build functions to perform some of these. You will then apply this on some data.

1. Data Normalization
A common method to scale a data set feature is to normalize it. This puts all of the features values
on a scale between 0 and 1. Given a set of values X1, X2, . . . , Xn, each corresponding normalized value
is calculated using the following formula.
         
 normalized Xi = Xi − min{X1, X2, . . . , Xn} / max{X1, X2, . . . , Xn} − min{X1, X2, . . . , Xn}

   Build a Python function that takes in a vector (array) and normalizes it.

In [27]:
import numpy as np
# create def function for normalization
def normalize(arr):
     return(arr - np.min(arr)) / (np.max(arr) - np.min(arr)) # formula for normallization
norm = [1,2,4,8,10,15]
normalize(norm)


array([0.        , 0.07142857, 0.21428571, 0.5       , 0.64285714,
       1.        ])

2. Data Standardization
Another common method to scale a data set feature is to standardize it. This calculates the z-score
of a feature value. Given a set of values X1, X2, . . . , Xn, with sample mean X and sample standard
deviation sX, then each corresponding standardized value is calculated using the following formula.
 
 standardized Xi = Xi − X/sX
 
Build a Python function that takes in a vector (array) and standardizes it

In [84]:
# standardization function
def standardized(arr):
        return (arr - np.mean(arr))/np.std(arr) # formula for standardization
arr = [1,2,3,4,5,6]
standardized(arr)
    
    

array([-1.46385011, -0.87831007, -0.29277002,  0.29277002,  0.87831007,
        1.46385011])

## 3. Working with a Dataframe

In this problem, you will be working with the data set calif housing data.csv. This data is a modified
version of the data set from https://www.kaggle.com/camnugent/california-housing-prices?select=housing.csv.
This data set has housing information on various California block neighborhoods. There are five
columns in the data, which include the median age (in years) of houses on the block, the total number
of bedrooms on the block, the total number of households on the block, the median income on the
block, and the median house value on the block. Now, suppose you are building a model to predict
the median house value. Use Python code to answer the following questions

In [48]:
import pandas as pd
# import file
df = pd.read_csv(r"C:\Users\bharo\OneDrive\Documents\calif_housing_data.csv")
df.head()

Unnamed: 0,housing_median_age,total_bedrooms,households,median_income,median_house_value
0,41,129.0,126,8.3252,452600.0
1,21,1106.0,1138,8.3014,358500.0
2,52,190.0,177,7.2574,352100.0
3,52,235.0,219,5.6431,341300.0
4,52,280.0,259,3.8462,342200.0


In [56]:
#  How many rows does this data set have?
df.shape[0]

20640

(b) What is the target vector for your model?

The target vector for this model is the median house value.

(c) Create a new feature by taking the total bedrooms divided by the number of households.  What does this new feature represent?

In [68]:
df['mean_bedroom'] = df.total_bedrooms/df.households # creating new column
df.head()
# it represent the mean value of total bedrooms with respect to household.

Unnamed: 0,housing_median_age,total_bedrooms,households,median_income,median_house_value,mean_bedroom
0,41,129.0,126,8.3252,452600.0,1.02381
1,21,1106.0,1138,8.3014,358500.0,0.97188
2,52,190.0,177,7.2574,352100.0,1.073446
3,52,235.0,219,5.6431,341300.0,1.073059
4,52,280.0,259,3.8462,342200.0,1.081081


(d) Now, create a new data frame that has three features: the median age, median income, and the
new feature created in part (c).

In [77]:
new_df = df[['housing_median_age','median_income','mean_bedroom']]
new_df.head()


Unnamed: 0,housing_median_age,median_income,mean_bedroom
0,41,8.3252,1.02381
1,21,8.3014,0.97188
2,52,7.2574,1.073446
3,52,5.6431,1.073059
4,52,3.8462,1.081081


(e) Take the data frame created in part (d) and apply data standardization to the features.

In [85]:
# calling normalize()
normalize(new_df) 


  return reduction(axis=axis, out=out, **passkwargs)
  return reduction(axis=axis, out=out, **passkwargs)


Unnamed: 0,housing_median_age,median_income,mean_bedroom
0,0.784314,0.539668,0.020469
1,0.392157,0.538027,0.018929
2,1.000000,0.466028,0.021940
3,1.000000,0.354699,0.021929
4,1.000000,0.230776,0.022166
...,...,...,...
20635,0.470588,0.073130,0.023715
20636,0.333333,0.141853,0.029124
20637,0.313725,0.082764,0.023323
20638,0.333333,0.094295,0.024859


In [86]:
# calling standardized()
standardized(new_df)

  return mean(axis=axis, dtype=dtype, out=out, **kwargs)


Unnamed: 0,housing_median_age,median_income,mean_bedroom
0,0.982143,2.344766,-0.153863
1,-0.607019,2.332238,-0.262936
2,1.856182,1.782699,-0.049604
3,1.856182,0.932968,-0.050417
4,1.856182,-0.012881,-0.033568
...,...,...,...
20635,-0.289187,-1.216128,0.076185
20636,-0.845393,-0.691593,0.459421
20637,-0.924851,-1.142593,0.048373
20638,-0.845393,-1.054583,0.157233


In [None]:
#