## My project on "Hand on ML with Keras Tensorflow and Sci-kit learn By Aurelien" 
### End-to-End Machine Learning Project🧊🧊🧊🧊🧊

In [None]:
import os
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
def load_housing_data(housing_path='./datasets/housing/'):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)
housing_df = load_housing_data()

In [None]:
housing_df

In [None]:
housing_df.hist(bins=50, figsize=(20, 15))
plt.show()

###### This md note is just for you to see how `np.random.permutation()` works:
*** 
It basically takes in an integer and gives a randomly arranged set of numbers which starts from zero up until the integer that number specified 
***
```
randstuff = np.random.permutation(10)
print(randstuff)
Output: array([9, 4, 1, 7, 8, 5, 0, 6, 2, 3])
```

- Let's seperate our training set from our test set!!

In [None]:
def train_test_split(data, test_ratio):
    shuffled_indexes = np.random.permutation(len(data))
    test_size = int(len(data) * test_ratio)
    test_set_indexes = shuffled_indexes[:test_size]
    train_set_indexes = shuffled_indexes[test_size:]
    return [data.iloc[train_set_indexes], data.iloc[test_set_indexes]]
    print('''
        Usage: train_set, test_set = split_train_test(data, test_ratio)
        data: A pandas dataframe....
        test_ratio: Should be in the range of [0 - 1]
    ''')

```
Well, the function above works well but there's still a problem here.
if we run the program again, we will get a different test set and overtime, your ML algorithm will see the whole dataset which is what me and you know you want to avoid 😎😎😎.
One solution is to save the testset on the firat run, and load it subsequently. Another option is to set the random number generator's seed (e.g np.random.seed(42)) before calling np.random.permutation(), so that it always generates the same shuffled indexes.
bUt the issue is, both soluutions above will fail if we fetch an updated dataset!!

A common solution is to use each instance's identifier to decide whether or not it should go
in the test set (assuming instances have a unique and immutable identifier). For
example, you could compute a hash of each instance's identifier and put that instance
in the test set if the hash is lower or equal to 20% of the maximum hash value. This
ensures that the test set will remain consistent across multiple runs, even if you
refresh the dataset. The new test set will contain 20% of the new instances, but it will
not contain any instance that was previously in the training set. Here is how it is implemented
```

In [None]:
from zlib import crc32

def test_set_check(identifier, test_ratio):
    return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

def split_train_test_by_id(data, test_ratio, id_column_name):
    ids = data[id_column_name]
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
    return [data.loc[~in_test_set], data.loc[in_test_set]]
    

In [None]:
housing_with_id = housing_df.reset_index() # adds an `index` column
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "index")

```
If you use the row index as a unique identifier, you need to make sure that new data
gets appended to the end of the dataset, and no row ever gets deleted. If this is not
possible, then you can try to use the most stable features to build a unique identifier.
For example, a district’s latitude and longitude are guaranteed to be stable for a few
million years, so you could combine them into an ID like so:
```

***
`housing_with_id["id"] = housing["longitude"] * 1000 + housing["latitude"]`
`train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "id")`
***

```
Scikit-Learn provides a few functions to split datasets into multiple subsets in various
ways. The simplest function is train_test_split, which does pretty much the same
thing as the function split_train_test defined earlier, with a couple of additional
features. First there is a random_state parameter that allows you to set the random
generator seed as explained previously, and second you can pass it multiple datasets
with an identical number of rows, and it will split them on the same indices (this is
very useful, for example, if you have a separate DataFrame for labels):
```

In [None]:
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(housing_df, test_size=0.2, random_state=42)

***
```
Suppose you chatted with experts who told you that the median income is a very
important attribute to predict median housing prices. You may want to ensure that
the test set is representative of the various categories of incomes in the whole dataset.
Since the median income is a continuous numerical attribute, you first need to create
an income category attribute. Let's look at the median income histogram more closely
(The histogram at the beginning of the notebook): most median income values are clustered around 1.5 to 6 (i.e.,
$15,000 - $60,000), but some median incomes go far beyond 6. It is important to have
a sufficient number of instances in your dataset for each stratum, or else the estimate
of the stratum's importance may be biased. This means that you should not have too
many strata, and each stratum should be large enough. The following code uses the
pd.cut() function to create an income category attribute with 5 categories (labeled
from 1 to 5): category 1 ranges from 0 to 1.5 (i.e., less than $15,000), category 2 from
1.5 to 3, and so on:
```
***

In [None]:
housing_df["income_categories"] = pd.cut(
    housing_df["median_income"],
    bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
    labels=[1, 2, 3, 4, 5]
)
housing_df["income_categories"].hist()
plt.figure(figsize=(1, 2))
plt.tight_layout()
plt.show()

- Notice that a new column has been added to our dataframe 🧊🧊🧊🧊

In [None]:
housing_df

***
Now you are ready to do `stratified sampling` based on the income category. For this you can use Scikit-Learn's `StratifiedShuffleSplit` class:
***
- Stratisfied sampling is all about taking data from all strata(categories) into consideration while building an ML model

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

- Let's see what is inside `split.split(housing_df, housing_df["income_categories"])`

In [None]:
for x, y in split.split(housing_df, housing_df["income_categories"]):
    print(x)
    print(y)
    print(len(x))
    print(len(y))

In [None]:
for train_index, test_index in split.split(housing_df, housing_df["income_categories"]):
    strat_train_set = housing_df.loc[train_index]
    strat_test_set = housing_df.loc[test_index]

***
```
Let’s see if this worked as expected. You can start by looking at the income category proportions in both the test and train sets.

Notice that the propotion of each category in the two sets are very similar(if not the same).
```
***

In [None]:
strat_test_set["income_categories"].value_counts() / len(strat_test_set)

In [None]:
strat_train_set["income_categories"].value_counts() / len(strat_train_set)

- Now that we have engineered our train and test dataframes using stratisfied sampling on our `median income` column, we can safely drop the `income_categories` column we created the other time!!!

In [None]:
for dataframe in [strat_test_set, strat_train_set]:
    dataframe.drop(["income_categories"], axis=1, inplace=True)

***
```
We spent quite a bit of time on test set generation for a good reason: this is an often
neglected but critical part of a Machine Learning project. Moreover, many of these
ideas will be useful later when we discuss cross-validation. Now it’s time to move on
to the next stage: exploring the data.

```
***

## Discover and Visualize the Data to Gain Insights

In [None]:
new_housing_df = strat_train_set.copy()
new_housing_df

In [None]:
new_housing_df.plot(kind="scatter", x="longitude", y="latitude")
plt.show()

In [None]:
new_housing_df.plot(kind="scatter", x="longitude", y="latitude", alpha=0.2)
plt.show()

In [None]:
new_housing_df.plot(
    kind="scatter", x="longitude", y="latitude", alpha=0.4, s=new_housing_df["population"]/100,
    label="population", figsize=(10, 7), c="median_house_value", cmap=plt.get_cmap("jet"), 
    colorbar=True
)

plt.show()

## Looking for Correlations
```
Since the dataset is not too large you can easily compute the standard correlation
coefficient (also called Pearson's r) between every pair of attributes using the corr()
method:
```

In [None]:
correlation = new_housing_df.corr(numeric_only=True)
correlation["median_house_value"].sort_values(ascending=False)

### Experimenting with Attribute Combinations
```
Hopefully the previous sections gave you an idea of a few ways you can explore the
data and gain insights. You identified a few data quirks that you may want to clean up
before feeding the data to a Machine Learning algorithm, and you found interesting
correlations between attributes, in particular with the target attribute. You also
noticed that some attributes have a tail-heavy distribution, so you may want to trans‐
form them (e.g., by computing their logarithm). Of course, your mileage will vary
considerably with each project, but the general ideas are similar.
One last thing you may want to do before actually preparing the data for Machine
Learning algorithms is to try out various attribute combinations. For example, the
total number of rooms in a district is not very useful if you don’t know how many
households there are. What you really want is the number of rooms per household.
Similarly, the total number of bedrooms by itself is not very useful: you probably
want to compare it to the number of rooms. And the population per household also
seems like an interesting attribute combination to look at. Let’s create these new
attributes:
```

In [None]:
new_housing_df["rooms_per_household"] = new_housing_df["total_rooms"]/new_housing_df["households"]
new_housing_df["bedrooms_per_room"] = new_housing_df["total_bedrooms"]/new_housing_df["total_rooms"]
new_housing_df["population_per_household"]=new_housing_df["population"]/new_housing_df["households"]

In [None]:
correlation = new_housing_df.corr(numeric_only=True)
correlation["median_house_value"].sort_values(ascending=False)

***
```
We can see that the columns we engineered has more correlation with the median_house_value than the actual column we engineered them from!! This means that we can actually add our enigneered columns into our dataframe if those columns have a reasonable correlation with our label, which is median_house_value in this case
```
***

## Prepare the Data for Machine Learning Algorithms
It's time to prepare the data for your Machine Learning algorithms. Instead of just
doing this manually, you should write functions to do that, for several good reasons:
- This will allow you to reproduce these transformations easily on any dataset (e.g.,
the next time you get a fresh dataset).
- You will gradually build a library of transformation functions that you can reuse
in future projects.
- You can use these functions in your live system to transform the new data before
feeding it to your algorithms.
- This will make it possible for you to easily try various transformations and see which combination of transformations works best.
```
But first let’s revert to a clean training set (by copying strat_train_set once again),
and let’s separate the predictors and the labels since we don’t necessarily want to apply
the same transformations to the predictors and the target values (note that drop()
creates a copy of the data and does not affect strat_train_set):
```

In [None]:
housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()

## Data Cleaning
Many datasets use to have some missing values and if not controlled well, it can negatively affect out model.

#### There are three common ways to solve this problem:
1. Get rid of the whole row that has a missing column in it. we do this using df.dropna([missing_column])
2. Get rid of a whole column. We do this by df.drop(column, axis=1)
3. Set the values to some value (zero, the mean, the median, etc.). We do this using df[column].fillna()

Sklearn provides a handy class to take care of missing values. it it SimpleImputer

In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median")
# Since the median can only be computed on numerical attributes, we need to create a copy of the data without the text attribute ocean_proximity:

housing_num = housing.drop("ocean_proximity", axis=1)
imputer.fit_transform(housing_num)

- To accees the median value that was used to fill missing values, you can use the instance variable `statistics_`

In [None]:
imputer.statistics_

In [None]:
imputer.strategy

- Let's turn the data in th `ocean_proximity` column to integers using `oneHotEncoder()`
- **Note:** `oneHotEncoder()` works like `pd.get_dummies()`

In [None]:
housing["ocean_proximity"]

In [None]:
from sklearn.preprocessing import OneHotEncoder
encoder_obj = OneHotEncoder()
ocean_prox_encoded = encoder_obj.fit_transform(housing[["ocean_proximity"]])
ocean_prox_encoded

##### ***
```
Notice that the output is a SciPy sparse matrix, instead of a NumPy array. This is very
useful when you have categorical attributes with thousands of categories. After one-hot encoding we get a matrix with thousands of columns, and the matrix is full of
zeros except for a single 1 per row. Using up tons of memory mostly to store zeros
would be very wasteful, so instead a sparse matrix only stores the location of the non‐zero elements. You can use it mostly like a normal 2D array,21 but if you really want to
convert it to a (dense) NumPy array, just call the toarray() method
```
***

In [None]:
ocean_prox_encoded.toarray()

### Custom Transformers
Although Scikit-Learn provides many useful transformers, you will need to write
your own for tasks such as custom cleanup operations or combining specific
attributes. You will want your transformer to work seamlessly with Scikit-Learn func‐
tionalities (such as pipelines), and since Scikit-Learn relies on duck typing (not inher‐
itance), all you need is to create a class and implement three methods: fit()
(returning self), transform(), and fit_transform(). You can get the last one for
free by simply adding TransformerMixin as a base class. Also, if you add BaseEstima
tor as a base class (and avoid *args and ** kwargs in your constructor) you will get
two extra methods (get_params() and set_params()) that will be useful for auto‐matic hyperparameter tuning. For example, here is a small transformer class that adds
the combined attributes we discussed earlier

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6
class CombinedAttributeAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room=True):
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
        population_per_household = X[:, population_ix] / X[:, households_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room]
        else:
            return np.c_[X,  rooms_per_household, population_per_household]

### Feature scaling
Many machine leearning algorithms don't use to perform well when our training set has features of very different scales.
To solve this problem, we have to rescale our features and there are two commons ways to get all our features to have the same scale.
1. Min-max scaling (also known as `normalization`)
2. standardization

### Transformation Pipelines 

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [None]:
transform_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('feature_adder', CombinedAttributeAdder()),
    ('standard_scaler', StandardScaler()),
])
housing_num_tr = transform_pipeline.fit_transform(housing_num)
print(housing_num.shape)
housing_num_tr.shape

- Notice that we now have 11 columns after working with the pipilene. This is because the `CombinedAttributeAdder()` in our pipeline actually added three more colums to our datasets..

***For real, pipelines are cool 🧊🧊🧊 !!!***

When you call the pipeline’s fit() method, it calls fit_transform() sequentially on
all transformers, passing the output of each call as the parameter to the next call, until
it reaches the final estimator, for which it just calls the fit() method.

In [None]:
list(housing_num)

So far, we have handled the categorical columns and the numerical columns sepa‐
rately. It would be more convenient to have a single transformer able to handle all col‐
umns, applying the appropriate transformations to each column. In version 0.20,
Scikit-Learn introduced the ColumnTransformer for this purpose, and the good news
is that it works great with Pandas DataFrames. Let’s use it to apply all the transforma‐
tions to the housing data

In [None]:
from sklearn.compose import ColumnTransformer
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer([
    ("num", transform_pipeline, num_attribs),
    ("cat", OneHotEncoder(), cat_attribs)
])

housing_prepared = full_pipeline.fit_transform(housing)
housing_prepared

### Select and Train a Model
At last! You framed the problem, you got the data and explored it, you sampled a
training set and a test set, and you wrote transformation pipelines to clean up and
prepare your data for Machine Learning algorithms automatically. You are now ready
to select and train a Machine Learning model.

- #### Let’s first train a Linear Regression model

In [None]:
from sklearn.linear_model import LinearRegression
regresssor = LinearRegression()

regresssor.fit(housing_prepared, housing_labels)

- Now that we have trained our model with `regressor.fit()`, Let's use it to predict.

In [None]:
some_test_data = housing.iloc[:5, :]
some_test_data_prepared = full_pipeline.transform(some_test_data)
some_label_data = housing_labels.iloc[:5]
print(some_label_data)
some_test_data

In [None]:
print(f"Predictions: {regresssor.predict(some_test_data_prepared)}")
print(f"Labels: {list(some_label_data)}")

In [None]:
strat_test_set
test_set_attrib = strat_test_set.drop(["median_house_value"], axis=1)
test_set_label = strat_test_set["median_house_value"]

In [None]:
test_set_attrib_prepared = full_pipeline.transform(test_set_attrib)
test_set_attrib_prepared

In [None]:
regresssor.score(test_set_attrib_prepared, test_set_label)

 - Yhayy!! we have have trained our model and tested it. Although, our model accuracy is still bad. 
 Let's measure the RMSE on the whole training data to visualize what is realy going on.

In [None]:
from sklearn.metrics import mean_squared_error
housing_predictions = regresssor.predict(some_test_data_prepared)
lin_mse = mean_squared_error(some_label_data, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

    - With this, we can see that our current model (let's say model 1.0) is not a good median_house_price predictor a most districts' median_housing_values range between `$120,000` and `$265,000`, so a typical prediction error of `$47159.2238` is not very satisfying
        
***This is an example of the model underfitting the training data***

#### As we have seen in the previous lessons, the main ways to fix underfitting are:
1. To select a more powerful model
2. To feed the training algorithm with better features
3. To reduce the constraints on the model

Note: This model is not regularized, so this rules
out the last option

- Let’s train a DecisionTreeRegressor. This is a powerful model, capable of finding complex nonlinear relationships in the data.

In [None]:
from sklearn.tree import DecisionTreeRegressor
tree_regressor = DecisionTreeRegressor()
tree_regressor.fit(housing_prepared, housing_labels)

In [None]:
tree_predictions = tree_regressor.predict(housing_prepared)
print(f"Predictions: {list(tree_predictions)[:10]}")
print(f"Labels: {list(housing_labels)[:10]}")

In [None]:
tree_mse = mean_squared_error(housing_labels, tree_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

```
Wait, what!? No error at all? Could this model really be absolutely perfect? Of course,
it is much more likely that the model has badly overfit the data. How can you be sure?
As we saw earlier, you don’t want to touch the test set until you are ready to launch a
model you are confident about, so you need to use part of the training set for train‐
ing, and part for model validation
```

In [None]:
# Just a random plot hahahaha!!!!
plt.plot(housing_labels)

```
One way to evaluate the Decision Tree model would be to use the train_test_split
function to split the training set into a smaller training set and a validation set, then
train your models against the smaller training set and evaluate them against the vali‐
dation set. It’s a bit of work, but nothing too difficult and it would work fairly well.
A great alternative is to use Scikit-Learn’s K-fold cross-validation feature. The follow‐
ing code randomly splits the training set into 10 distinct subsets called folds, then it
trains and evaluates the Decision Tree model 10 times, picking a different fold for
evaluation every time and training on the other 9 folds. The result is an array con‐
taining the 10 evaluation scores:
```

In [None]:
from sklearn.model_selection import cross_val_score
tree_scores = cross_val_score(tree_regressor, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-tree_scores)
list(tree_rmse_scores)

In [None]:
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

In [None]:
display_scores(tree_rmse_scores)

In [None]:
import plotly.graph_objs as go
import plotly as ply
# Create random data with numpy
import numpy as np
N = 100
random_x = np.linspace(0, 1, N)
random_y0 = np.random.randn(N)+5
random_y1 = np.random.randn(N)
random_y2 = np.random.randn(N)-5
# Create traces
trace0 = go.Scatter(
 x = random_x,
y = random_y0,
mode = 'lines',
name = 'lines'
)
trace1 = go.Scatter(
 x = random_x,
 y = random_y1,
 mode = 'lines+markers',
 name = 'lines+markers'
)
trace2 = go.Scatter(
 x = random_x,
 y = random_y2,
 mode = 'markers',
 name = 'markers'
)
data = [trace0, trace1, trace2]
ply.offline.plot(data, filename='line-mode.html')


In [None]:
# complete the decision tree algorithm when you figure out how and why it is said to overfit

```
Random forest works by training many decision trees on random subsets of the features, then averaging out their predictons.
```
***Building a model on top of many other models is called ensemble Learning and it is often a great way to push ML algorithms even further.***

In [None]:
from sklearn.ensemble import RandomForestRegressor
forest_regressor = RandomForestRegressor()
forest_regressor.fit(housing_prepared, housing_labels)

In [None]:
forest_predictions = forest_regressor.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, forest_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

In [None]:
forest_rmse_scores = cross_val_score(forest_regressor, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10)
display_scores(forest_rmse_scores)