<a href="https://colab.research.google.com/github/AbbisreeSaadhvi/Python-Projects/blob/main/K_Nearest_Neighbors_(KNN).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#K-Nearest Neighbors (KNN)

K-Nearest Neighbors (KNN) is a simple, instance-based learning algorithm that is used for classification and regression tasks. In KNN, the target value of a data point is determined by the majority class (for classification) or the average (for regression) of its k-nearest neighbors in the feature space.

The key steps are:

1. Choosing the number of neighbors (k).
2. Calculating the distance between the data points.
3. Sorting the data points based on distance.
4. Voting for the majority class or averaging the values.

**Dataset:** Titanic Dataset


In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

# Load the Titanic dataset from GitHub
url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
data = pd.read_csv(url)

# Display the first few rows of the dataset
print(data.head())

# Drop columns that are not useful for prediction
data = data.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin'])

# Define features and target
X = data.drop(columns=['Survived'])
y = data['Survived']

# Preprocess the data
# Identify categorical and numerical columns
categorical_cols = ['Sex', 'Embarked']
numerical_cols = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']

# Create transformers for preprocessing
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Create a KNN classifier
knn = KNeighborsClassifier(n_neighbors=5)

# Create a pipeline that preprocesses the data and then fits the KNN model
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', knn)
])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the model
pipeline.fit(X_train, y_train)

# Make predictions
y_pred = pipeline.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print('Classification Report:')
print(report)


   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  
Ac

**Interpretation**
*Accuracy:* The model has an accuracy of approximately 80.44%, meaning it correctly predicts whether a passenger survived or not about 80.44% of the time.

**Classification Report:**

*Precision:* The precision for class 0 (not survived) is 0.81, meaning that 81% of the passengers predicted not to survive did not survive. For class 1 (survived), the precision is 0.79, meaning 79% of the passengers predicted to survive actually survived.


*Recall:* The recall for class 0 is 0.87, meaning that 87% of the actual not-survived passengers were correctly identified. For class 1, the recall is 0.72, meaning 72% of the actual survived passengers were correctly identified.


*F1-Score:* The F1-score is the harmonic mean of precision and recall. For class 0, it is 0.84, and for class 1, it is 0.75.


*Support:* The support shows the number of actual occurrences of each class in the test set (105 for class 0 and 74 for class 1).


##Conclusion

The KNN model performed reasonably well on the Titanic dataset, with an overall accuracy of approximately 80.44%. The model is slightly better at predicting passengers who did not survive (class 0) compared to those who survived (class 1). This is evident from the higher precision, recall, and F1-score for class 0. Further improvements could involve tuning the hyperparameters, exploring more sophisticated preprocessing steps, or trying different machine learning algorithms.

**NOTE:**

*SimpleImputer:*

This transformer is used to handle missing values. For numerical columns, we use the strategy mean to replace missing values with the mean of the column. For categorical columns, we use the strategy most_frequent to replace missing values with the most frequent value in the column.

*Pipeline for Preprocessing:*

1. numerical_transformer: This pipeline handles numerical columns by first imputing missing values with the mean and then scaling the values.
2. categorical_transformer: This pipeline handles categorical columns by first imputing missing values with the most frequent value and then applying one-hot encoding.
3. ColumnTransformer: This transformer combines the numerical and categorical transformers, applying them to the respective columns.

By adding these steps, the missing values in the dataset are handled.