**Chapter 2 – End-to-end Machine Learning project**

*Welcome to Machine Learning Housing Corp.! Your task is to predict median house values in Californian districts, given a number of features from these districts.*

*This notebook contains all the sample code and solutions to the exercices in chapter 2.*

**Note**: You may find little differences between the code outputs in the book and in these Jupyter notebooks: these slight differences are mostly due to the random nature of many training algorithms: although I have tried to make these notebooks' outputs as constant as possible, it is impossible to guarantee that they will produce the exact same output on every platform. Also, some data structures (such as dictionaries) do not preserve the item order. Finally, I fixed a few minor bugs (I added notes next to the concerned cells) which lead to slightly different results, without changing the ideas presented in the book.

# Setup

First, let's make sure this notebook works well in both python 2 and 3, import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures:

In [132]:
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

# Common imports 
import numpy as np

#print(np.__version__) #print numpy version, e.g., 1.21.5
#to upgrade: pip install --upgrade numpy

import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "C:\\Users\\user\\Desktop\\AN 3\\SEM 2\\Intelligent systems\\VaidaDiana2"
CHAPTER_ID = "VaidaDianaLaura"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

# Ignore useless warnings (see SciPy issue #5998)
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

# Get the data

In [133]:
import os
import pandas as pd

DATASETS_ROOT = "C:\\Users\\user\\Desktop\\AN 3\\SEM 2\\Intelligent systems\\VaidaDiana2\\datasets"
CONFLICT_FOLDER = "pv"
CSV_FILENAME = "MergeConflictsDataset.csv"

CONFLICT_PATH = os.path.join(DATASETS_ROOT, CONFLICT_FOLDER)
CSV_PATH = os.path.join(CONFLICT_PATH, CSV_FILENAME)

def fetch_conflict_data(csv_path=CSV_PATH):
    if not os.path.exists(csv_path):
        raise FileNotFoundError(f"CSV file '{csv_path}' not found.")
    return pd.read_csv(csv_path,sep=';')
  

In [134]:
df=fetch_conflict_data()
df = df.drop(['commit','parent1','parent2','ancestor'], axis=1)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26973 entries, 0 to 26972
Data columns (total 33 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   is pr            26973 non-null  int64  
 1   added lines      26973 non-null  int64  
 2   deleted lines    26973 non-null  int64  
 3   devs parent1     26973 non-null  int64  
 4   devs parent2     26973 non-null  int64  
 5   time             26973 non-null  int64  
 6   nr files         26973 non-null  int64  
 7   added files      26973 non-null  int64  
 8   deleted files    26973 non-null  int64  
 9   renamed files    26973 non-null  int64  
 10  copied files     26973 non-null  int64  
 11  modified files   26973 non-null  int64  
 12  nr commits1      26973 non-null  int64  
 13  nr commits2      26973 non-null  int64  
 14  density1         26973 non-null  int64  
 15  density2         26973 non-null  int64  
 16  fix              26973 non-null  int64  
 17  bug         

In [135]:
import pandas as pd
#print(pd.__version__) #print pandas version, e.g., 0.24.2
#to upgrade: pip install --upgrade pandas

def load_conflict_data(conflict_path=CONFLICT_PATH):
    csv_path = os.path.join(conflict_path, "MergeConflictsDataset.csv")
    return pd.read_csv(csv_path,sep=';')

In [136]:
df = load_conflict_data()
df = df.drop(['commit','parent1','parent2','ancestor'], axis=1)
df.head()

Unnamed: 0,is pr,added lines,deleted lines,devs parent1,devs parent2,time,nr files,added files,deleted files,renamed files,...,add,remove,use,delete,change,messages_min,messages_max,messages_mean,messages_median,conflict
0,1,5,0,0,1,23,0,0,0,0,...,0,0,0,0,0,20,65,35.4,20.0,0
1,0,1166,11267,1,2,371,3,7,199,2,...,0,0,0,0,0,31,117,58.56383,53.5,1
2,1,0,0,0,1,22,0,0,0,0,...,0,0,0,0,0,18,18,18.0,18.0,0
3,1,0,0,2,1,24,1,0,0,0,...,0,0,0,0,0,22,63,38.8,31.0,0
4,0,0,0,1,2,2,1,0,0,0,...,0,0,0,0,0,31,56,43.5,43.5,1


In [137]:
# to make this notebook's output identical at every run
np.random.seed(42)

In [138]:
from sklearn.model_selection import train_test_split
X=df.drop('conflict',axis=1)
y=df['conflict']
X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.2,random_state=42, stratify=y)
X_valid, X_test, y_valid, y_test = train_test_split(X_test, y_test,test_size=0.2, random_state=42, stratify=y_test)

In [139]:
X_train.head()

Unnamed: 0,is pr,added lines,deleted lines,devs parent1,devs parent2,time,nr files,added files,deleted files,renamed files,...,update,add,remove,use,delete,change,messages_min,messages_max,messages_mean,messages_median
4645,0,12,12,0,1,1,0,0,0,0,...,0,0,0,0,0,0,9,9,9.0,9.0
26443,1,11,20,0,1,8,0,0,0,0,...,0,0,0,0,0,0,17,17,17.0,17.0
2508,1,18,2,0,1,28,0,0,0,0,...,0,0,0,0,0,0,35,35,35.0,35.0
1145,1,11,1,0,1,495,0,0,0,0,...,2,0,0,0,0,0,32,42,37.0,37.0
26590,1,0,1,0,1,16,0,0,0,0,...,0,0,1,0,0,0,47,47,47.0,47.0


In [140]:
%matplotlib inline
import tensorflow as tf
from tensorflow import keras
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
from matplotlib import cm

print(tf.__version__)



2.16.1


In [141]:
input_shape = [X_train.shape[1]]

In [142]:
from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
layers.Dense(32,activation='relu', input_shape=input_shape),
layers.Dense(64,activation='relu'),
layers.Dense(32,activation='relu'),
layers.Dense(1,activation='sigmoid'),
])


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [143]:
def our_loss(y_actual, logits):
    loss = tf.nn.sigmoid_cross_entropy_with_logits(logits=logits, labels=y_actual)
    loss = tf.reduce_mean(loss, axis=1)
    return loss


In [144]:
model.compile(optimizer=tf.keras.optimizers.Adam(0.00001),
loss='binary_crossentropy',
metrics=['recall'])

In [145]:
from tensorflow.keras.callbacks import Callback
class myCallback(Callback):    
    def on_epoch_end(self, epoch,logs=None):
        print("Checking loss at end of epoch...")
        if logs['loss'] <= 0.01:
               self.model.stop_training = True

In [None]:
loss_callback_obj = myCallback()
model.fit(
    X_train,
    y_train,
    batch_size=150,
     epochs=200,
      verbose=1,
    callbacks=[loss_callback_obj]
)

Epoch 1/200
[1m116/144[0m [32m━━━━━━━━━━━━━━━━[0m[37m━━━━[0m [1m0s[0m 1ms/step - loss: 64.8504 - recall: 1.0000Checking loss at end of epoch...
[1m144/144[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 920us/step - loss: 65.4515 - recall: 1.0000
Epoch 2/200
[1m133/144[0m [32m━━━━━━━━━━━━━━━━━━[0m[37m━━[0m [1m0s[0m 1ms/step - loss: 64.7067 - recall: 0.9873Checking loss at end of epoch...
[1m144/144[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 64.0983 - recall: 0.9856
Epoch 3/200
[1m129/144[0m [32m━━━━━━━━━━━━━━━━━[0m[37m━━━[0m [1m0s[0m 1ms/step - loss: 51.7106 - recall: 0.9087Checking loss at end of epoch...
[1m144/144[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 51.2618 - recall: 0.9070
Epoch 4/200
[1m141/144[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 1ms/step - loss: 32.1004 - recall: 0.8534Checking loss at end of epoch...
[1m144/144[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 

[1m103/144[0m [32m━━━━━━━━━━━━━━[0m[37m━━━━━━[0m [1m0s[0m 1ms/step - loss: 4.2108 - recall: 0.1411   Checking loss at end of epoch...
[1m144/144[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 3.8802 - recall: 0.1450
Epoch 33/200
[1m 96/144[0m [32m━━━━━━━━━━━━━[0m[37m━━━━━━━[0m [1m0s[0m 1ms/step - loss: 1.0702 - recall: 0.1509   Checking loss at end of epoch...
[1m144/144[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 1.4340 - recall: 0.1556
Epoch 34/200
[1m105/144[0m [32m━━━━━━━━━━━━━━[0m[37m━━━━━━[0m [1m0s[0m 1ms/step - loss: 2.4286 - recall: 0.1368Checking loss at end of epoch...
[1m144/144[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 2.4486 - recall: 0.1428
Epoch 35/200
[1m115/144[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 1ms/step - loss: 3.4599 - recall: 0.1683Checking loss at end of epoch...
[1m144/144[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - l

[1m116/144[0m [32m━━━━━━━━━━━━━━━━[0m[37m━━━━[0m [1m0s[0m 2ms/step - loss: 0.5077 - recall: 0.1812 Checking loss at end of epoch...
[1m144/144[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.5169 - recall: 0.1859
Epoch 64/200
[1m141/144[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 2ms/step - loss: 0.6204 - recall: 0.1911Checking loss at end of epoch...
[1m144/144[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.6208 - recall: 0.1913
Epoch 65/200
[1m125/144[0m [32m━━━━━━━━━━━━━━━━━[0m[37m━━━[0m [1m0s[0m 2ms/step - loss: 0.5789 - recall: 0.1647Checking loss at end of epoch...
[1m144/144[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.5893 - recall: 0.1678
Epoch 66/200
[1m142/144[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 2ms/step - loss: 0.6059 - recall: 0.2044Checking loss at end of epoch...
[1m144/144[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 

Epoch 95/200
[1m123/144[0m [32m━━━━━━━━━━━━━━━━━[0m[37m━━━[0m [1m0s[0m 1ms/step - loss: 0.4987 - recall: 0.2578Checking loss at end of epoch...
[1m144/144[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.4965 - recall: 0.2551
Epoch 96/200
[1m122/144[0m [32m━━━━━━━━━━━━━━━━[0m[37m━━━━[0m [1m0s[0m 1ms/step - loss: 0.5453 - recall: 0.2450 Checking loss at end of epoch...
[1m144/144[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.5328 - recall: 0.2433
Epoch 97/200
[1m113/144[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 2ms/step - loss: 0.3824 - recall: 0.2302 Checking loss at end of epoch...
[1m144/144[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.4021 - recall: 0.2297
Epoch 98/200
[1m122/144[0m [32m━━━━━━━━━━━━━━━━[0m[37m━━━━[0m [1m0s[0m 2ms/step - loss: 0.3464 - recall: 0.2368Checking loss at end of epoch...
[1m144/144[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms

In [116]:
model.evaluate(X_test,y_test)

[1m34/34[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 947us/step - loss: 86.8517 - recall: 1.0000 


[78.69978332519531, 1.0]