## KNN Algorithm
K-Nearest Neighbors (KNN) is a machine learning algorithm used for classification and regression tasks. It operates on the idea that similar data points tend to belong to the same category or have similar values. KNN is a non-parametric and lazy learner algorithm, which means it doesn't make assumptions about the underlying data distribution and postpones the actual learning until a prediction is needed.

**How KNN Works:**
1. **Select the Number of Neighbors (K):** You start by choosing the number of nearest neighbors, denoted as "K." This value determines how many neighboring data points will be considered when making a prediction.

2. **Calculate Euclidean Distance:** The algorithm calculates the Euclidean distance between the new data point and all other data points in the dataset. Euclidean distance is like measuring the straight-line distance between two points in space.

3. **Select K Nearest Neighbors:** The KNN algorithm then selects the K data points with the shortest Euclidean distances to the new data point. Among the K nearest neighbors, the algorithm counts how many data points belong to each category or class.

4. **Assign the New Data Point:** The algorithm assigns the category of the new data point based on which category has the most neighbors among the K nearest neighbors.

5. **Model is Ready:** Your KNN model is now ready to make predictions for new data points.

**Choosing the Value of K:**
- There is no strict rule for selecting the best K value, so you often need to experiment with different values.
- A commonly used value for K is 5, but it can vary depending on your dataset.
- Very low K values (e.g., 1 or 2) can be sensitive to outliers and lead to noisy predictions.
- Large K values can provide smoother predictions but might not capture local patterns well.
- It is important to use odd values for K.

**Advantages of KNN:**
- Simple to implement and understand.
- Robust to noisy data.
- Effective when the training dataset is large.

**Disadvantages of KNN:**
- Requires selecting the appropriate K value.
- Computationally expensive for large datasets due to distance calculations for all training samples.

**Applications of KNN:**
- Data preprocessing: KNN can be used to impute missing values in datasets.
- Recommendation systems: It's used to suggest items or content based on user behavior.
- Finance: Assessing credit risk by predicting loan solvency.
- Healthcare: Predicting health-related outcomes like disease risk.
- Pattern recognition: Identifying patterns in text, images, and more.

**Strengths and Weaknesses:**
- Strengths: Easy to implement, adaptable to changing data, and requires few hyperparameters.
- Weaknesses: Inefficient for large datasets, sensitive to the curse of dimensionality (works less well in high-dimensional spaces), and prone to overfitting or underfitting depending on the choice of K.

In [1]:
#Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Perceptron
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.impute import SimpleImputer

In [2]:
df = pd.read_csv('Indicadores_municipales_sabana_DA.csv', index_col=0, encoding='latin-1')
df

Unnamed: 0_level_0,nom_ent,mun,clave_mun,nom_mun,pobtot_ajustada,pobreza,pobreza_e,pobreza_m,vul_car,vul_ing,...,pobreza_alim_10,pobreza_cap_90,pobreza_cap_00,pobreza_cap_10,pobreza_patrim_90,pobreza_patrim_00,pobreza_patrim_10,gini_90,gini_00,gini_10
ent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Aguascalientes,1,1001,Aguascalientes,794304,30.531104,2.264478,28.266627,27.983320,8.419106,...,11.805700,20.4,12.7,18.474600,43.4,33.7,41.900398,0.473,0.425,0.422628
1,Aguascalientes,2,1002,Asientos,48592,67.111172,8.040704,59.070468,22.439389,5.557604,...,21.993299,39.9,29.0,30.980801,64.2,48.9,59.175800,0.379,0.533,0.343879
1,Aguascalientes,3,1003,Calvillo,53104,61.360527,7.241238,54.119289,29.428583,2.921336,...,19.266800,39.5,33.1,28.259199,63.9,57.9,56.504902,0.414,0.465,0.386781
1,Aguascalientes,4,1004,Cosío,14101,52.800458,4.769001,48.031458,27.128568,7.709276,...,14.303200,35.2,21.0,22.386101,59.7,40.1,51.164501,0.392,0.541,0.344984
1,Aguascalientes,5,1005,Jesús María,101379,45.338512,6.084037,39.254475,26.262912,8.279864,...,15.085100,36.6,22.6,22.139999,60.6,42.2,45.703899,0.391,0.469,0.458083
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32,Zacatecas,54,32054,Villa Hidalgo,21016,74.848837,12.301183,62.547654,19.229856,3.177689,...,30.055300,51.8,54.8,41.368999,73.5,70.9,70.859596,0.403,0.589,0.342037
32,Zacatecas,55,32055,Villanueva,27385,65.450191,10.203506,55.246687,23.623556,5.007426,...,13.138800,34.2,25.9,20.563601,57.8,44.1,46.659199,0.422,0.463,0.362527
32,Zacatecas,56,32056,Zacatecas,117528,29.541959,3.535624,26.006335,16.644262,8.828019,...,7.164800,15.7,20.7,12.115300,36.6,41.8,32.302700,0.528,0.498,0.436339
32,Zacatecas,57,32057,Trancoso,20456,78.374962,14.607016,63.767946,13.750759,4.440331,...,21.285900,36.2,36.4,30.037100,60.5,54.7,57.394501,0.380,0.483,0.365307


In [3]:
columns_to_remove = ['nom_ent', 'nom_mun', 'gdo_rezsoc00', 'gdo_rezsoc05', 'gdo_rezsoc10']
df.drop(columns=columns_to_remove, inplace=True)

In [4]:
# Fill the empty spaces using median method
imputer = SimpleImputer(strategy='median')

# Fit the imputer on the data and transform it
dataset = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)


In [5]:
# Find the minimum and maximum values in the 'percentage' column
min_percentage = dataset['ic_rezedu'].min()
max_percentage = dataset['ic_rezedu'].max()

# Calculate the cutoffs for 'low' and 'high' categories based on the minimum and maximum values
percentage_cutoff = min_percentage + (max_percentage - min_percentage) / 2

# Create categories based on the percentage of educational backwardness
dataset['access_level_cat'] = pd.cut(
    dataset['ic_rezedu'],
    bins=[min_percentage, percentage_cutoff, max_percentage],
    labels=['low', 'high']
)
# Map 'access_level' labels to numerical values
label_mapping = {'low': 2, 'high': 1}
dataset['access_level'] = dataset['access_level_cat'].map(label_mapping)

In [6]:
columns_to_remove = ['access_level_cat']
dataset.drop(columns=columns_to_remove, inplace=True)

In [7]:
# 80% for train
X_train = dataset.iloc[:1965, 0:132].values
y_train = dataset.iloc[:1965, 133].values

# 20% for test
X_test = dataset.iloc[1966:, 0:132].values
y_test = dataset.iloc[1966:, 133].values

In [8]:
# Creates an instance of the standardization scaler
scaler = StandardScaler()

# Adjust the climber to your training data and transform data
X_train_standardized = scaler.fit_transform(X_train)
X_test_standardized = scaler.transform(X_test)

In [9]:
# Define the KNN function
def knn_predict(X_train, y_train, X_test, k=5):
    y_pred = []

    for test_point in X_test:
        # Calculate distances from the test point to all training points
        distances = [np.linalg.norm(test_point - train_point) for train_point in X_train]

        # Get indices of the k-nearest neighbors
        nearest_indices = np.argsort(distances)[:k]

        # Get the labels of the k-nearest neighbors
        nearest_labels = [y_train[i] for i in nearest_indices]

        # Predict the class by majority voting
        prediction = max(set(nearest_labels), key=nearest_labels.count)
        y_pred.append(prediction)

    return y_pred

# Define the accuracy function
def accuracy(y_true, y_pred):
    correct = sum(1 for a, b in zip(y_true, y_pred) if a == b)
    return correct / len(y_true)

# Use the knn_predict function to make predictions
y_pred = knn_predict(X_train_standardized, y_train, X_test_standardized, k=5)

# Calculate and print the accuracy
model_accuracy = accuracy(y_test, y_pred)
print("Accuracy:", model_accuracy)

Accuracy: 0.8938775510204081
