**Getting Our Hands Dirty with Data
**
In this section, we will go through an example project related to Real Estates. The following are the main steps we will go through in this section:
1. Look at the big picture.
2. Get the data.
3. Discover and visualize the data to gain insights.
4. Prepare the data for Machine Learning algorithms.


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### 1. Look at the Big Picture


### Real Estate Median Price Prediction.

Given data about real estate in Boston, let's try to predict the median values of a given home.

### Predict the crime per capita of this dataset .



#### Frame Your Problem

the problem: is supervised,
Is a regression task
atch learning or online learning techniques?

#### Select a Performance Measure
###### MSE (Mean-squared error)
A typical performance measure for regression problems is the Root Mean Square Error (RMSE). 
Another such measure is called, Mean Absolute Error (MAE). 
Both the RMSE and the MAE are ways to measure the distance between two vectors: the vector of predictions and the vector of target values. 
RMSE is more sensitive to outliers than MAE. But when outliers are expotentially rare, the RMSE performs very well and is generally preferred.

### 2. Get the Data

We are required to:
1. Read the data from the above URL and store it in a variable.
2. Print the top most five rows in your loaded dataset (HINT: using head() method).

In [None]:
data=pd.read_csv('/kaggle/input/real-estate-dataset/data.csv')

In [None]:
data.head()

### 3. Discover and visualize the data to gain insights[](http://)

The info() method is useful to get a quick description of the data, in particular the total number of rows, and each attribute’s type and number of non-null values.

We are required to:
1. Use the info() method to find out the type of data fields present in our dateset and identify fields with null values.

In [None]:
data.info()

We have null values in the attribute RM "average number of rooms per dwelling"[](http://)


we are required to:
1. Using the describe() method, find a summary of the numerical attributes of our dataset.

In [None]:
tab_desc=data.describe()
tab_desc

In [None]:
tab_desc = tab_desc.iloc[1:]
tab_desc

Note: The describe() method has ignored the null values in total_bedrooms data column.

## One of the best plots that we can draw to visualize outliers, is a Box and Whisker plot.

You are required to:
1. Draw a Box and Whisker plot for the numerical attributes of our dataset.
2. Draw a Scatter plot for longitude and latitude columns of our dataset.

In [None]:
tab_desc=np.array(tab_desc)


In [None]:

import matplotlib.pyplot as plot

import numpy as np

plot.boxplot((tab_desc[:,0], tab_desc[:,1], tab_desc[:,2],tab_desc[:,3],tab_desc[:,4],tab_desc[:,5],tab_desc[:,6]))

plot.show()

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
data.hist(bins=50,figsize=(20,15))
plt.show()

### 4. Prepare the data for Machine Learning algorithms

## create testset 

In [None]:
from zlib import crc32
def test_set_check(identifier,test_ratio):
    return crc32(np.int64(identifier))& 0xffffffff < test_ratio * 2 **32
def split_train_test_by_id(data, test_ratio, id_column):
    ids=data[id_column]
    in_test_set= ids.apply(lambda id_ : test_set_check(id_, test_ratio))
    return data.loc[~in_test_set],data.loc[in_test_set]

In [None]:
data_with_id= data.reset_index() ## add an 'index' column
train_set, test_set = split_train_test_by_id(data_with_id,0.2,'index')

In [None]:
data_with_id['id']= data['CRIM']*1000 + data['ZN']
train_set,test_set = split_train_test_by_id(data_with_id,0.2,'id')

In [None]:
test_set.describe()

In [None]:
import pandas as pd
data['Tax_marge'] = pd.cut(data['TAX'],bins=[0.0,200.0 ,350.0 ,400.0  ,650.0 ,np.inf],labels=[1,2,3,4,5])

In [None]:
data['Tax_marge'].hist()

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index,test_index in split.split(data,data["Tax_marge"]):
    strat_train_set=data.loc[train_index]
    strat_test_set=data.loc[test_index]

In [None]:
strat_test_set["Tax_marge"].value_counts() / len(strat_test_set)

In [None]:
for set_ in (strat_train_set, strat_test_set):
    set_.drop("Tax_marge", axis=1, inplace=True)

## Exploring the data

<p style='color: green; font-weight:bold '>Discover and Visualize the Data to Gain Insights </p>

In [None]:
data = strat_train_set.copy()

In [None]:
dataTest=strat_test_set.copy()

In [None]:
data.info()

In [None]:
dataTest.info()

<p style='color: green; font-weight:bold '>Visualizing Geographical Data</p>

In [None]:
data.plot(kind="scatter",x="ZN",y="LSTAT")

<p style='color: green; font-weight:bold '>Looking for correlation </p>

In [None]:
corr_matrix= data.corr()

In [None]:
corr_matrix

In [None]:
corr_matrix['MEDV'].sort_values(ascending=False)

In [None]:
from pandas.plotting import scatter_matrix
attributs=['RM','PTRATIO','TAX','LSTAT']
scatter_matrix(data[attributs],figsize=(12,8))


In [None]:
data.plot(kind="scatter",x="RM",y="MEDV")

In [None]:
data.plot(kind="scatter",x="RM",y="MEDV",alpha=0.1)

<p style='color: green; font-weight:bold '>Experimenting with Attribute Combinations</p>

In [None]:
data['TAX_RM']= data['TAX']/data['RM']


In [None]:
corr_matrix=data.corr()
corr_matrix['MEDV'].sort_values(ascending=False)

## revert to clean training set

In [None]:
data=strat_train_set.drop("MEDV",axis=1)
data_labels= strat_train_set['MEDV'].copy()

In [None]:
data.info()

Date cleaning
<br>
we cannot work with missing features. to fix this we can :
<br>
<ol>
    <li>Get rid of the corresponding districs.</li>
<li>Get rid of the whole attribute.</li>
<li>Set the values to some value (zero, the mean , the median , etc.)</li>
</ol>

In [None]:
#data.dropna(subset=["RM"])

In [None]:
#data.drop("RM", axis=1)

In [None]:
median= data["RM"].median()
## we have to conserve the value of the median to fill the empty values 
### in the test set  
data["RM"].fillna(median,inplace=True)

In [None]:
### Scikit-learn provides a handy class to handle missing values.
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")
imputer.fit(data) 
## it's safer to apply the imputer to all the numerical attributes.

In [None]:
##stored the result in the instance called statistics_
imputer.statistics_

In [None]:
data.median().values

In [None]:
# replacing the missed values with the learned medians:

x=imputer.transform(data)

# result is a plain Numpy array

In [None]:
data_transform=pd.DataFrame(x,columns=data.columns, index=data.index)
data_transform

In [None]:
data_transform.describe()

<p style='color: green; font-weight:bold '>Feature Scaling</p>

### min_max scaling (normalisation) and standardization.
### transformation Pipelines.

In [None]:
## for numerical attributes
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline=Pipeline([
    ('imputer',SimpleImputer(strategy="median")),
   ########## u can other transformers.
    ('std_scaler', StandardScaler())
])

In [None]:
data_prepared=num_pipeline.fit_transform(data)

<p style='color: green; font-weight:bold '>Select and Train a Model</p>

In [None]:
from sklearn.svm import SVR
svr_rbf = SVR(kernel='rbf', C=100, gamma=0.1, epsilon=.1)
svr_rbf.fit(data_prepared, data_labels)

In [None]:
data=pd.DataFrame(data)
some_data = data.iloc[:5]
some_labels = data_labels.iloc[:5]
some_prepared_data = num_pipeline.transform(some_data)
svr_rbf.predict(some_prepared_data)

In [None]:
print("Labels: ",list(some_labels))

In [None]:
## let's Measure this regression model's RMSE on the whole training set using Scikit_Learn's mean_squared_error() function

from sklearn.metrics import mean_squared_error

data_predictions=svr_rbf.predict(data_prepared)
svr_mse=mean_squared_error(data_labels,data_predictions)
svr_rmse=np.sqrt(svr_mse)
svr_rmse

<p style='color: green; font-weight:bold '>Better Evaluation Using Cross-validation</p>

In [None]:
from sklearn.model_selection import cross_val_score
scores= cross_val_score(svr_rbf,data_prepared,data_labels,scoring = 'neg_mean_squared_error',cv=10)
##cv split data set into 10 distinct subsets called folders.
svr_rmse_scores=np.sqrt(-scores)

In [None]:
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:",scores.std())
display_scores(svr_rmse_scores)

In [None]:
### we can easly save scikit-learn models by using Python's pickle module or serializing a large umpy array.

import joblib
joblib.dump(svr_rbf,"my_model.pkl")
## and later
#my_model_loaded=joblib.load("my_model.pkl")

<p style='color: green; font-weight:bold '>Fine-Tune My Model</p>

## few ways to do fine-tuning

<ul>
    <li>Grid Search</li>
    <li>Randomized Search</li>
    <li>Ensemble Methodes</li>
    
  <ul>


In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform
distributions = dict(C=uniform(loc=0, scale=4))
clf = RandomizedSearchCV(svr_rbf, distributions, random_state=0)
search = clf.fit(data_prepared, data_labels)
search.best_params_

<p style='color: green; font-weight:bold '> Evaluate Your System On the Test Set </p>

In [None]:
final_model = search.best_estimator_
x_test=strat_test_set.drop("MEDV", axis= 1)
y_test=strat_test_set["MEDV"].copy()
X_test_prepared=num_pipeline.transform(x_test)
final_predictions=final_model.predict(X_test_prepared)
final_mse=mean_squared_error(y_test,final_predictions)
final_rmse=np.sqrt(final_mse) #=> evaluates to 47,730.2


In [None]:
final_rmse