<a href="https://colab.research.google.com/github/Niroth36/Machine_Learning_First_Assignment/blob/main/Copy_of_Lecture_6_Applied_Machine_Learning_8_ProjectTemplate.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Template

A guide for applying machine learning on a dataset.

## Step 1: Prepare Project

1. Load libraries
2. Load dataset

In [1]:
# Load libraries
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

# Load dataset
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()

# Create a DataFrame for analysis
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Get dimensions
print("=== Dataset Dimensions ===")
print(f"Shape of data (samples, features): {data.data.shape}")
print(f"Number of samples (patients): {data.data.shape[0]}")
print(f"Number of features: {data.data.shape[1]}")
print(f"Shape of target variable: {data.target.shape}")
print(f"Feature names: {len(data.feature_names)}")
print(f"Target names: {data.target_names}")
print(f"\nFeature names list:")
for i, feature in enumerate(data.feature_names):
    print(f"  {i+1:2d}. {feature}")

=== Dataset Dimensions ===
Shape of data (samples, features): (569, 30)
Number of samples (patients): 569
Number of features: 30
Shape of target variable: (569,)
Feature names: 30
Target names: ['malignant' 'benign']

Feature names list:
   1. mean radius
   2. mean texture
   3. mean perimeter
   4. mean area
   5. mean smoothness
   6. mean compactness
   7. mean concavity
   8. mean concave points
   9. mean symmetry
  10. mean fractal dimension
  11. radius error
  12. texture error
  13. perimeter error
  14. area error
  15. smoothness error
  16. compactness error
  17. concavity error
  18. concave points error
  19. symmetry error
  20. fractal dimension error
  21. worst radius
  22. worst texture
  23. worst perimeter
  24. worst area
  25. worst smoothness
  26. worst compactness
  27. worst concavity
  28. worst concave points
  29. worst symmetry
  30. worst fractal dimension


### A peek at the data

In [2]:
print("="*60)
print("PEEK AT BREAST CANCER DATA")
print("="*60)

# 1. View first 20 rows
print("\n1. First 20 rows:")
print(df.head(20))

# 2. Dimensions
print("\n\n2. Dimensions of the data:")
print(f"Shape: {df.shape}")
print(f"Rows (patients): {df.shape[0]}")
print(f"Columns: {df.shape[1]}")

# 3. Data types
print("\n\n3. Data Type of Each Attribute:")
print(df.dtypes)

# 4. Check target variable
print("\n\n4. Target variable check:")
print("First few rows with target:")
print(df[['mean radius', 'mean texture', 'target']].head())
print("\nLast few rows with target:")
print(df[['mean radius', 'mean texture', 'target']].tail())

# 5. Basic statistics
print("\n\n5. Basic statistics:")
print(df.describe())

# 6. Target distribution
print("\n\n6. Target distribution:")
print(df['target'].value_counts())
print("\nTarget meaning:")
print(f"  0 = {data.target_names[0]}")
print(f"  1 = {data.target_names[1]}")

PEEK AT BREAST CANCER DATA

1. First 20 rows:
    mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0         17.99         10.38          122.80     1001.0          0.11840   
1         20.57         17.77          132.90     1326.0          0.08474   
2         19.69         21.25          130.00     1203.0          0.10960   
3         11.42         20.38           77.58      386.1          0.14250   
4         20.29         14.34          135.10     1297.0          0.10030   
5         12.45         15.70           82.57      477.1          0.12780   
6         18.25         19.98          119.60     1040.0          0.09463   
7         13.71         20.83           90.20      577.9          0.11890   
8         13.00         21.82           87.50      519.8          0.12730   
9         12.46         24.04           83.97      475.9          0.11860   
10        16.02         23.24          102.70      797.8          0.08206   
11        15.78         17.89 

### Statistical summary of all attributes

In [3]:
print("Statistical Summary of All Attributes:")
print("-"*50)
print(df.describe())

Statistical Summary of All Attributes:
--------------------------------------------------
       mean radius  mean texture  mean perimeter    mean area  \
count   569.000000    569.000000      569.000000   569.000000   
mean     14.127292     19.289649       91.969033   654.889104   
std       3.524049      4.301036       24.298981   351.914129   
min       6.981000      9.710000       43.790000   143.500000   
25%      11.700000     16.170000       75.170000   420.300000   
50%      13.370000     18.840000       86.240000   551.100000   
75%      15.780000     21.800000      104.100000   782.700000   
max      28.110000     39.280000      188.500000  2501.000000   

       mean smoothness  mean compactness  mean concavity  mean concave points  \
count       569.000000        569.000000      569.000000           569.000000   
mean          0.096360          0.104341        0.088799             0.048919   
std           0.014064          0.052813        0.079720             0.038803   


In [4]:
# Basic stats only
print(df.describe().loc[['mean', 'std', 'min', 'max']])

      mean radius  mean texture  mean perimeter    mean area  mean smoothness  \
mean    14.127292     19.289649       91.969033   654.889104         0.096360   
std      3.524049      4.301036       24.298981   351.914129         0.014064   
min      6.981000      9.710000       43.790000   143.500000         0.052630   
max     28.110000     39.280000      188.500000  2501.000000         0.163400   

      mean compactness  mean concavity  mean concave points  mean symmetry  \
mean          0.104341        0.088799             0.048919       0.181162   
std           0.052813        0.079720             0.038803       0.027414   
min           0.019380        0.000000             0.000000       0.106000   
max           0.345400        0.426800             0.201200       0.304000   

      mean fractal dimension  ...  worst texture  worst perimeter  \
mean                0.062798  ...      25.677223       107.261213   
std                 0.007060  ...       6.146258        33.602542

In [5]:
# Transposed - easier to read
print(df.describe().T)

                         count        mean         std         min  \
mean radius              569.0   14.127292    3.524049    6.981000   
mean texture             569.0   19.289649    4.301036    9.710000   
mean perimeter           569.0   91.969033   24.298981   43.790000   
mean area                569.0  654.889104  351.914129  143.500000   
mean smoothness          569.0    0.096360    0.014064    0.052630   
mean compactness         569.0    0.104341    0.052813    0.019380   
mean concavity           569.0    0.088799    0.079720    0.000000   
mean concave points      569.0    0.048919    0.038803    0.000000   
mean symmetry            569.0    0.181162    0.027414    0.106000   
mean fractal dimension   569.0    0.062798    0.007060    0.049960   
radius error             569.0    0.405172    0.277313    0.111500   
texture error            569.0    1.216853    0.551648    0.360200   
perimeter error          569.0    2.866059    2.021855    0.757000   
area error          

## Step 2: Define Problem
What is your task? What are your goals? What do you want to achieve?

## Step 3: Exploratory Analysis
Understand your data: Take a “peek” of your data, answer basic questions about the dataset.
Summarise your data. Explore descriptive statistics and visualisations.

## Step 4: Prepare Data
Data Cleaning/Data Wrangling/Collect more data (if necessary).

## Step 5: Feature Engineering
Feature selection/feture engineering (as in new features)/data transformations.

## Step 6: Algorithm Selection
Select a set of algorithms to apply, select evaluation metrics, and evaluate/compare algorithms.

## Step 7: Model Training
Apply ensembles and improve performance by hyperparameter optimisation.

## Step 8: Finalise Model
Predictions on validation set, create model from the entire (training) dataset.