# Hands-On Machine Learning Tutorial (Part 2)

**Introduction:**
This Jupyter Notebook builds upon the data analysis skills from Part 1, introducing machine learning, deep learning, and time series analysis using Python. It covers fundamental concepts from various Kaggle courses, providing a practical, hands-on approach to these advanced topics.


**Prerequisites:**
* Completion of Part 1 tutorial or equivalent knowledge
* Basic understanding of Python and data analysis concepts
* Jupyter Notebook environment set up
* Install required libraries: pandas, scikit-learn, matplotlib, seaborn, tensorflow, prophet, statsmodels


In [None]:
!pip install pandas scikit-learn matplotlib seaborn tensorflow prophet statsmodels keras keras.utils

## 1. Machine Learning Fundamentals
Based on "Intro to Machine Learning"

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import mean_absolute_error


Load preprocessed Titanic data from Part 1
(Assuming titanic_data is already cleaned and feature-engineered)

In [None]:
### Download Kaggle Titanic Dataset (train.csv)
!wget -O train.csv "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
!wget -O test.csv "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"


In [32]:
### Read the Dataset
titanic_data = pd.read_csv("train.csv")

In [33]:
# Prepare features and target
X = titanic_data.drop(['Survived', 'Name', 'Ticket', 'Cabin'], axis=1)
y = titanic_data['Survived']

In [34]:
# Create a new feature: Family Size
titanic_data['FamilySize'] = titanic_data['SibSp'] + titanic_data['Parch'] + 1

In [35]:
# Create age groups
titanic_data['AgeGroup'] = pd.cut(titanic_data['Age'], bins=[0, 18, 35, 50, 100], labels=['Child', 'Young Adult', 'Adult', 'Senior'])

In [36]:
# Extract title from Name
titanic_data['TitleCategory'] = titanic_data['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)

In [37]:
# Group rare titles
rare_titles = ['Lady', 'Countess','Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona']
titanic_data['TitleCategory'] = titanic_data['TitleCategory'].replace(rare_titles, 'Rare')

In [74]:
# Select features
X = titanic_data[['Sex', 'Embarked', 'TitleCategory', 'AgeGroup', 'Pclass', 'Fare', 'Age', 'FamilySize']]


In [75]:
# Convert categorical variables using one-hot encoding
X = pd.get_dummies(X, columns=['Sex', 'Embarked', 'TitleCategory', 'AgeGroup', 'Age', 'FamilySize', 'Pclass', 'Fare',])

In [76]:
# Split data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Train Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
rf_model.fit(X_train, y_train)

In [78]:
# Make predictions
preds = rf_model.predict(X_val)

In [None]:
# Evaluate
print(f"Accuracy: {accuracy_score(y_val, preds):.2f}")
print("Confusion Matrix:")
print(confusion_matrix(y_val, preds))

In [None]:
import matplotlib.pyplot as plt

# Feature Importance
feat_importances = pd.Series(rf_model.feature_importances_, index=X.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.title('Top 10 Feature Importances')
plt.show()

## 2. Deep Learning Basics
Based on "Intro to Deep Learning"

In [102]:
import tensorflow as tf
from tensorflow.keras import layers
import numpy as np
import matplotlib.pyplot as plt
import keras
from tensorflow.keras.utils import to_categorical
#from keras.utils import np_utils

In [16]:
# Load CIFAR10 data
(X_train_dl, y_train_dl), (X_test_dl, y_test_dl) = tf.keras.datasets.cifar10.load_data()

In [None]:
# Checking the number of rows (records) and columns (features)
print(X_train_dl.shape)
print(y_train_dl.shape)
print(X_test_dl.shape)
print(y_test_dl.shape)

In [None]:
# Checking the number of unique classes
print(np.unique(y_train_dl))
print(np.unique(y_test_dl))

In [None]:
# Show some random training images
class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
               'dog', 'frog', 'horse', 'ship', 'truck']

plt.figure(figsize=(10, 10))
for i in range(5):
    plt.subplot(1, 5, i + 1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(X_train_dl[i], cmap=plt.cm.binary)
    plt.xlabel(class_names[y_train_dl[i][0]])
plt.show()

In [7]:
# Data Preprocess

# Converting the pixels data to float type and Standardizing (255 is the total number of pixels an image can have)
X_train_dl = X_train_dl.reshape(-1, 28, 28, 1).astype('float32') / 255.0
X_test_dl = X_test_dl.reshape(X_test_dl.shape[0], 28, 28, 1).astype('float32') / 255.0

In [None]:
# One hot encoding the target class (labels)
num_classes = 10
train_labels = to_categorical(y_train_dl, num_classes)
test_labels = to_categorical(y_test_dl, num_classes)

In [None]:
# Build CNN model
model = tf.keras.Sequential([
    layers.Conv2D(32, (3,3), activation='relu', input_shape=(32,32,3)),
    layers.MaxPooling2D((2,2)),
    layers.Flatten(),
    layers.Dense(128, activation='relu'),
    layers.Dense(10, activation='softmax')
])

In [104]:
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

In [None]:
# Train
history = model.fit(X_train_dl, y_train_dl,
                   epochs=5,
                   validation_data=(X_test_dl, y_test_dl))

In [None]:
# Plot training history
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='validation')
plt.title('Training History')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend()
plt.show()


In [None]:
# Plot training history
plt.plot(history.history['accuracy'], label='train')
plt.plot(history.history['val_accuracy'], label='validation')
plt.title('Training History')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend()
plt.show()

In [127]:
import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras import datasets, layers, models
from tensorflow.keras import regularizers
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
from tensorflow.keras.utils import to_categorical

In [129]:
# One hot encoding the target class (labels)
num_classes = 10
y_train_dl = to_categorical(y_train_dl, num_classes)
y_test_dl = to_categorical(y_test_dl, num_classes)

In [None]:
# Creating a sequential model and adding layers to it

model2 = Sequential()

model2.add(layers.Conv2D(32, (3,3), padding='same', activation='relu', input_shape=(32,32,3)))
model2.add(layers.BatchNormalization())
model2.add(layers.Conv2D(32, (3,3), padding='same', activation='relu'))
model2.add(layers.BatchNormalization())
model2.add(layers.MaxPooling2D(pool_size=(2,2)))
model2.add(layers.Dropout(0.3))

model2.add(layers.Conv2D(64, (3,3), padding='same', activation='relu'))
model2.add(layers.BatchNormalization())
model2.add(layers.Conv2D(64, (3,3), padding='same', activation='relu'))
model2.add(layers.BatchNormalization())
model2.add(layers.MaxPooling2D(pool_size=(2,2)))
model2.add(layers.Dropout(0.5))

model2.add(layers.Conv2D(128, (3,3), padding='same', activation='relu'))
model2.add(layers.BatchNormalization())
model2.add(layers.Conv2D(128, (3,3), padding='same', activation='relu'))
model2.add(layers.BatchNormalization())
model2.add(layers.MaxPooling2D(pool_size=(2,2)))
model2.add(layers.Dropout(0.5))

model2.add(layers.Flatten())
model2.add(layers.Dense(128, activation='relu'))
model2.add(layers.BatchNormalization())
model2.add(layers.Dropout(0.5))
model2.add(layers.Dense(num_classes, activation='softmax'))    # num_classes = 10

# Checking the model summary
model2.summary()

In [131]:
model2.compile(optimizer='adam', loss=keras.losses.categorical_crossentropy, metrics=['accuracy'])

In [None]:
# Train
history2 = model2.fit(X_train_dl, y_train_dl,
                   epochs=5,
                   validation_data=(X_test_dl, y_test_dl))

In [None]:
# Plot training history
plt.plot(history2.history['loss'], label='train')
plt.plot(history2.history['val_loss'], label='validation')
plt.title('Training History')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend()
plt.show()

In [None]:
# Plot training history
plt.plot(history2.history['accuracy'], label='train')
plt.plot(history2.history['val_accuracy'], label='validation')
plt.title('Training History')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend()
plt.show()

In [None]:
# Making the Predictions
pred = model.predict(X_test_dl)
print(pred)

# Converting the predictions into label index
pred_classes = np.argmax(pred, axis=1)
print(pred_classes)

In [None]:
# Plotting the Actual vs. Predicted results

fig, axes = plt.subplots(5, 5, figsize=(15,15))
axes = axes.ravel()

for i in np.arange(0, 25):
    axes[i].imshow(X_test_dl[i])
    axes[i].set_title("True: %s \nPredict: %s" % (class_names[np.argmax(y_test_dl[i])], class_names[pred_classes[i]]))
    axes[i].axis('off')
    plt.subplots_adjust(wspace=1)

## Computer Vision: Data Augmentation
Based on "Computer Vision" course

In [107]:
# Image augmentation example
from tensorflow.keras.preprocessing.image import ImageDataGenerator

In [108]:
datagen = ImageDataGenerator(
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    zoom_range=0.2)

In [None]:
# Example image from the training set
img_to_augment = X_train_dl[0]

# Generate augmented images
augmented_images = datagen.flow(np.expand_dims(img_to_augment, axis=0), batch_size=1)

# Visualize the original image and some augmented versions
plt.figure(figsize=(10, 5))
plt.subplot(1, 5, 1)
plt.imshow(img_to_augment)
plt.title('Original Image')
plt.axis('off')

for i in range(4):
    augmented_img = next(augmented_images)[0].astype('uint8')
    plt.subplot(1, 5, i + 2)
    plt.imshow(augmented_img)
    plt.title(f'Augmented Image {i+1}')
    plt.axis('off')

plt.tight_layout()
plt.show()

## Conclusion
In this advanced tutorial, we have covered:
1. Machine Learning fundamentals with Random Forest
2. Introduction to Deep Learning with CNNs
3. Basic Computer Vision concepts


 This notebook has provided a practical introduction to various advanced data science topics.
 To deepen your understanding, consider:
 - Experimenting with different models and hyperparameters
 - Exploring more advanced time series techniques
 - Diving deeper into deep learning architectures
 - Applying these concepts to different datasets and problem domains


 Remember, the field of data science and machine learning is vast and rapidly evolving.
 Continuous learning and practice are key to mastering these skills.

## References
[Intro to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning) \\
[Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning) \\
[Computer Vision](https://www.kaggle.com/learn/computer-vision) \\
[Intro to Deep Learning](https://www.kaggle.com/learn/intro-to-deep-learning) \\
[Time Series](https://www.kaggle.com/learn/time-series) \\
[Titanic Dataset](https://www.kaggle.com/c/titanic/data) \\
[Pandas](https://www.kaggle.com/learn/pandas) \\
[Data Cleaning](https://www.kaggle.com/learn/data-cleaning) \\
[Data Visualization](https://www.kaggle.com/learn/data-visualization) \\
[Feature Engineering](https://www.kaggle.com/learn/feature-engineering) \\
