# SUPERVISED LEARING 
# Here we have the following Project Description
* The data are MC generated (see below) to simulate registration of high energy gamma particles in a ground-based atmospheric Cherenkov gamma telescope using the imaging technique. Cherenkov gamma telescope observes high energy gamma rays, taking advantage of the radiation emitted by charged particles produced inside the electromagnetic showers initiated by the gammas, and developing in the atmosphere. This Cherenkov radiation (of visible to UV wavelengths) leaks through the atmosphere and gets recorded in the detector, allowing reconstruction of the shower parameters. The available information consists of pulses left by the incoming Cherenkov photons on the photomultiplier tubes, arranged in a plane, the camera. Depending on the energy of the primary gamma, a total of few hundreds to some 10000 Cherenkov photons get collected, in patterns (called the shower image), allowing to discriminate statistically those caused by primary gammas (signal) from the images of hadronic showers initiated by cosmic rays in the upper atmosphere (background).

* Typically, the image of a shower after some pre-processing is an elongated cluster. Its long axis is oriented towards the camera center if the shower axis is parallel to the telescope's optical axis, i.e. if the telescope axis is directed towards a point source. A principal component analysis is performed in the camera plane, which results in a correlation axis and defines an ellipse. If the depositions were distributed as a bivariate Gaussian, this would be an equidensity ellipse. The characteristic parameters of this ellipse (often called Hillas parameters) are among the image parameters that can be used for discrimination. The energy depositions are typically asymmetric along the major axis, and this asymmetry can also be used in discrimination. There are, in addition, further discriminating characteristics, like the extent of the cluster in the image plane, or the total sum of depositions.


In [57]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import RandomOverSampler

# Feature and Target Description

## Variable Details

| Variable Name | Role     | Type       | Description                                        | Units | Missing Values |
|---------------|----------|------------|--------------------------------------------------|-------|----------------|
| fLength       | Feature  | Continuous | major axis of ellipse                            | mm    | no             |
| fWidth        | Feature  | Continuous | minor axis of ellipse                            | mm    | no             |
| fSize         | Feature  | Continuous | 10-log of sum of content of all pixels           | #phot | no             |
| fConc         | Feature  | Continuous | ratio of sum of two highest pixels over fSize    |       | no             |
| fConc1        | Feature  | Continuous | ratio of highest pixel over fSize                |       | no             |
| fAsym         | Feature  | Continuous | distance from highest pixel to center, projected onto major axis |       | no             |
| fM3Long       | Feature  | Continuous | 3rd root of third moment along major axis        | mm    | no             |
| fM3Trans      | Feature  | Continuous | 3rd root of third moment along minor axis        | mm    | no             |
| fAlpha        | Feature  | Continuous | angle of major axis with vector to origin        | deg   | no             |
| fDist         | Feature  | Continuous | distance from origin to center of ellipse        | mm    | no             |
| class         | Target   | Binary     | gamma (signal), hadron (background)              |       | no             |



In [17]:
# we need the name for the columns right
cols = ["fLength" , "fWidth", "fSize","fConc","fConc1","fAsym","fM2Long" ,"fM3Trans","fAlpha","fDist","class"]
dataset = pd.read_csv("magic04.data" , names=cols) # Here names = cols is we are naming the each column with the title 
dataset.head()

Unnamed: 0,fLength,fWidth,fSize,fConc,fConc1,fAsym,fM2Long,fM3Trans,fAlpha,fDist,class
0,28.7967,16.0021,2.6449,0.3918,0.1982,27.7004,22.011,-8.2027,40.092,81.8828,g
1,31.6036,11.7235,2.5185,0.5303,0.3773,26.2722,23.8238,-9.9574,6.3609,205.261,g
2,162.052,136.031,4.0612,0.0374,0.0187,116.741,-64.858,-45.216,76.96,256.788,g
3,23.8172,9.5728,2.3385,0.6147,0.3922,27.2107,-6.4633,-7.1513,10.449,116.737,g
4,75.1362,30.9205,3.1611,0.3168,0.1832,-5.5277,28.5525,21.8393,4.648,356.462,g


In [19]:
dataset["class"].unique()
dataset["class"] = (dataset["class"] == "g").astype(int)
dataset.head()

Unnamed: 0,fLength,fWidth,fSize,fConc,fConc1,fAsym,fM2Long,fM3Trans,fAlpha,fDist,class
0,28.7967,16.0021,2.6449,0.3918,0.1982,27.7004,22.011,-8.2027,40.092,81.8828,1
1,31.6036,11.7235,2.5185,0.5303,0.3773,26.2722,23.8238,-9.9574,6.3609,205.261,1
2,162.052,136.031,4.0612,0.0374,0.0187,116.741,-64.858,-45.216,76.96,256.788,1
3,23.8172,9.5728,2.3385,0.6147,0.3922,27.2107,-6.4633,-7.1513,10.449,116.737,1
4,75.1362,30.9205,3.1611,0.3168,0.1832,-5.5277,28.5525,21.8393,4.648,356.462,1


In [None]:
for label in cols[:-1]:
    # Here for analysing the data we are using the matlpotlib library for the histogram represetation to analyse the dataset
    plt.hist(dataset[dataset["class"]==1][label] , color='blue' , label= 'gamma' , alpha = 0.7 , density=True) 
        #here am asking the dataset of the class 1 that is hedron to be in the blue color
    plt.hist(dataset[dataset["class"]==0][label] , color='red' , label = 'hedron' , alpha = 0.7 , density=True)
     #here am asking the dataset of the class 0 that is gamma to be in the blue color
    plt.title(label)
    plt.xlabel(label)
    plt.ylabel("Probability")
    plt.legend()
    plt.show()

In [None]:
dataset.isnull().sum() # This is for checking if your dataset has any missing value 

In [34]:
# dataset.drop_duplicates() this is for removing the duplicates
# print(dataset.duplicated().sum()) this is for finding the number of duplicates in the data set

# Model Training, Validation, and Testing

In the process of building and evaluating a machine learning model, the dataset is typically divided into three distinct subsets: **Training**, **Validation**, and **Testing**. These subsets serve different purposes to ensure the model is both accurate and generalizes well to unseen data. Below is an explanation of each subset and its purpose:

## 1. **Training Data**
- **Purpose**: The training data is used to **train** the machine learning model. During this phase, the model learns the underlying patterns, relationships, and features in the dataset.
- **How it's used**: The model is optimized using the training data through methods like gradient descent. The model’s weights and biases are updated to minimize the loss function during this phase.
- **Proportion of dataset**: The training set typically represents the largest portion of the data, around 70-80% of the total dataset.

## 2. **Validation Data**
- **Purpose**: The validation data is used to tune the model’s **hyperparameters** and evaluate its performance **during** training. It helps in preventing overfitting by providing an unbiased evaluation while the model is still being trained.
- **How it's used**: After each training epoch, the model is evaluated on the validation set. This helps in adjusting hyperparameters like learning rate, regularization, and deciding when to stop training (early stopping).
- **Proportion of dataset**: The validation set typically represents about 10-15% of the total dataset.

## 3. **Test Data**
- **Purpose**: The test data is used to evaluate the **final performance** of the model after it has been trained and validated. This dataset helps assess how well the model generalizes to unseen data.
- **How it's used**: After the model has completed training and tuning, it is tested on this set to measure its performance in terms of accuracy, precision, recall, F1 score, or other relevant metrics.
- **Proportion of dataset**: The test set typically makes up about 10-15% of the total dataset.

## 4. **Dataset Splitting**
To ensure the model is trained and evaluated properly, we split the dataset into three subsets: Training, Validation, and Test sets.

### Example of Splitting the Dataset:
```python
from sklearn.model_selection import train_test_split

# Assuming 'data' is a pandas DataFrame containing the dataset
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

# Further split the test data into validation and test sets
val_data, test_data = train_test_split(test_data, test_size=0.5, random_state=42)

# Now we have:
# train_data: 80% of the original data
# val_data: 10% of the original data
# test_data: 10% of the original data


In [75]:
train , validate , test = np.split(dataset.sample(frac=1),[int(0.6 * len(dataset)),int(0.8 * len(dataset))])

  return bound(*args, **kwds)


# Scale Dataset Function

This repository contains a Python function `scale_dataset` that standardizes and preprocesses a dataset for use in machine learning models.

---

## Code Explanation

### 1. Input Data Splitting**
```python
x = dataframe[dataframe.columns[:-1]].values
y = dataframe[dataframe.columns[-1]].values
```
* x: The independent variables (features), extracted from all columns except the last one in the DataFrame.
* y: The dependent variable (target), extracted from the last column in the DataFrame.

### 2. Feature Standarization
```python
scaler = StandardScaler()
x = scaler.fit_transform(x)
```
* scaler: An instance of `StandardScaler` from `sklearn` used to standardize features.
* x: The standardized features, transformed to have a mean of 0 and a standard deviation of 1.

### 3. Combination  Feature
```python
data = np.hstack((x, np.reshape(y, (-1, 1))))
```
* data: A combined dataset that horizontally stacks the standardized features and the target variable.
* np.reshape(y, (-1, 1)): Reshapes the target variable into a column vector.
* np.hstack: Horizontally stacks the standardized features (`x`) with the target variable (`y`).


In [59]:
def scale_datset(dataframe , oversample = False):
    x = dataframe[dataframe.columns[:-1]].values
    y = dataframe[dataframe.columns[-1]].values

    scaler = StandardScaler()
    x = scaler.fit_transform(x)

    if oversample:
        ros = RandomOverSampler()
        x , y  = ros.fit_resample(x,y)

    data  = np.hstack((x, np.reshape(y , (-1,1))))

    return data , x , y

In [69]:
print(len(train[train["class"] == 1]))  
print(len(train[train["class"] == 0]))

#  The length of the Train model for the gamma and the hadron is not same as they having count as 7413 and 3930 so it will become biased as
# we dont get the perfect accuracy so we need to sample it 

7407
3936


In [76]:
train , x_train , y_train = scale_datset(train , oversample=True)
validate , x_validate , y_validate = scale_datset(validate, oversample=False)
test , x_test , y_test = scale_datset(test , oversample=False)



In [72]:
print(len(x_train))
print(len(y_train))   

# After the oversampled we had the x and the y train  to be equal sampled so that we can able to train the data 

14814
14814


# K-Nearest Neighbors (KNN) Algorithm

## Overview

The **K-Nearest Neighbors (KNN)** algorithm is a simple, intuitive, and versatile machine learning algorithm used for both classification and regression tasks. It is a **non-parametric**, **lazy learning** algorithm that makes predictions based on the proximity of data points.

### Key Features:
- **Lazy Learning:** KNN doesn't learn an explicit model. Instead, it memorizes the training dataset and makes predictions at runtime by examining the nearest data points.
- **Non-Parametric:** KNN doesn't assume any underlying data distribution.
- **Instance-Based Learning:** KNN makes decisions based on the instances in the training data without creating a generalized model.

## How KNN Works

### Classification:
1. **Choose the number of neighbors (k):**
   - `k` is a hyperparameter that determines how many nearest neighbors to consider when making a prediction.
   - If `k = 3`, for example, KNN will look at the 3 nearest neighbors and choose the most frequent class label among them.

2. **Compute Distance:**
   - The distance between a new data point (query point) and each point in the training dataset is computed. Common distance metrics include:
     - **Euclidean Distance:** Most widely used, calculated as:
       \[
       \text{Distance} = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}
       \]
     - **Manhattan Distance:** Alternative, computed as the sum of absolute differences:
       \[
       \text{Distance} = \sum_{i=1}^{n} |x_i - y_i|
       \]
     - **Minkowski Distance:** A generalization of both Euclidean and Manhattan distances.

3. **Sort the Data:**
   - After calculating distances, the algorithm sorts the training dataset by the distance from the query point.

4. **Vote for Classification:**
   - The `k` nearest neighbors' class labels are examined, and the most frequent label is returned as the prediction.

### Regression:
- In regression tasks, KNN predicts the value of the query point by averaging the values of the `k` nearest neighbors.

## The Math Behind KNN

### Distance Metrics:
KNN uses various distance metrics to determine the proximity of data points:

1. **Euclidean Distance:**
   The most common measure of distance used in KNN.
   \[
   \text{Euclidean Distance} = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}
   \]
   where `x_i` and `y_i` are the coordinates of the data points in the n-dimensional space.

2. **Manhattan Distance:**
   The sum of absolute differences between corresponding feature values.
   \[
   \text{Manhattan Distance} = \sum_{i=1}^{n} |x_i - y_i|
   \]

3. **Minkowski Distance:**
   A generalization of both Euclidean and Manhattan distances. For `p = 2,` it's equivalent to Euclidean distance, and for `p = 1,` it is Manhattan distance.
   \[
   \text{Minkowski Distance} = \left( \sum_{i=1}^{n} |x_i - y_i|^p \right)^{1/p}
   \]

### Choosing k:
- A small value of `k` can lead to a noisy model, while a large value can make the model too smooth, ignoring smaller patterns.
- It's common to test various `k` values using cross-validation to determine the best `k`.

## Why Use KNN?

### Advantages:
- **Simple and Intuitive:** The algorithm is easy to understand and implement.
- **No Training Phase:** Since KNN is a lazy learner, it doesn't require a training phase, making it faster for certain use cases.
- **Works Well with Small to Medium-Sized Datasets:** KNN works efficiently when the dataset is not too large, as the computational complexity increases with large datasets.

### Disadvantages:
- **Computationally Expensive:** For large datasets, calculating distances for every prediction can be slow.
- **Memory-Intensive:** The algorithm needs to store the entire dataset in memory.
- **Sensitive to Irrelevant Features:** KNN's performance can degrade if the data contains irrelevant or noisy features.
- **Choosing k Can Be Challenging:** The optimal value of `k` may vary for different datasets and tasks.

## Types of Problems KNN Can Solve

### 1. **Classification Problems:**
   - **Spam Detection:** Classifying emails as spam or not spam based on text features.
   - **Image Classification:** Identifying objects in images by comparing pixel values to labeled data points.
   - **Customer Segmentation:** Classifying customers based on purchasing behavior or demographics.

### 2. **Regression Problems:**
   - **Predicting House Prices:** Estimating house prices based on features like size, location, and number of rooms.
   - **Stock Price Prediction:** Predicting stock prices based on historical data.

### 3. **Anomaly Detection:**
   - Identifying rare data points that don't follow the usual patterns (e.g., fraud detection).

### 4. **Recommendation Systems:**
   - **Collaborative Filtering:** Recommending items based on the preferences of similar users.

## When to Use KNN

- **When the dataset is small to medium-sized** and can fit into memory.
- **When the decision boundary is highly non-linear**, as KNN can model complex decision surfaces.
- **When the problem is based on proximity or similarity** between data points.

## When Not to Use KNN

- **Large datasets:** KNN requires a lot of memory and computational resources for large datasets because it calculates the distance for every data point at runtime.
- **High-dimensional data (curse of dimensionality):** The performance of KNN can degrade significantly when dealing with high-dimensional data (i.e., datasets with many features).
- **When the data is very sparse (few neighbors)** or when there is significant noise in the data.

## Conclusion

K-Nearest Neighbors is a simple yet powerful algorithm that is widely used for both classification and regression tasks. Despite its simplicity, its performance heavily depends on choosing the right `k` and distance metric, and its computation cost can be a concern for large datasets. KNN is most effective when the dataset is small and the relationships between data points are based on proximity.

---

Feel free to experiment with different distance metrics and values of `k` to tune the KNN algorithm for your specific use case.


In [84]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

In [93]:
Knn_model = KNeighborsClassifier(n_neighbors=3)
Knn_model.fit(x_train , y_train)

In [97]:
y_predict = Knn_model.predict(x_test)

In [98]:
print(classification_report(y_test, y_predict))

              precision    recall  f1-score   support

           0       0.72      0.73      0.72      1316
           1       0.85      0.85      0.85      2465

    accuracy                           0.81      3781
   macro avg       0.79      0.79      0.79      3781
weighted avg       0.81      0.81      0.81      3781

