# Imports

I am importing the necessary libraries. 
`pandas` will help me handle data efficiently, 
while the `sklearn` libraries will allow me to preprocess the data, 
split it into training and testing sets, and build models.

In [25]:
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split

I am loading the dataset into a pandas DataFrame using `pd.read_csv`.
I am also setting the "id" column as the index because it **uniquely** identifies each row.

In [26]:
df = pd.read_csv("./Autism.csv", index_col="id")

Now, I am displaying the first five rows of the dataset to understand its structure.

If you want to see more than first five rows then -

`df.head(n)`

here n is a number and this function will show the first n rows
For example - `df.head(10)` will show first ten rows. By default the number is 5

In [27]:
df.head()

Unnamed: 0_level_0,A1_Score,A2_Score,A3_Score,A4_Score,A5_Score,A6_Score,A7_Score,A8_Score,A9_Score,A10_Score,age,gender,ethnicity,jaundice,country_of_res,used_app_before,result,age_desc,relation,Class/ASD
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
0,1,1,1,1,0,0,1,1,0,0,26.0,f,White-European,no,United States,no,6,18 and more,Self,NO
1,1,1,0,1,0,0,0,1,0,1,24.0,m,Latino,no,Brazil,no,5,18 and more,Self,NO
2,1,1,0,1,1,0,1,1,1,1,27.0,m,Latino,yes,Spain,no,8,18 and more,Parent,YES
3,1,1,0,1,0,0,1,1,0,1,35.0,f,White-European,no,United States,no,6,18 and more,Self,NO
4,1,0,0,0,0,0,0,1,0,0,40.0,f,,no,Egypt,no,2,18 and more,,NO


Similarly `df.tail()` shows the last 5 rows and giving it a number n will show last n rows. 

# Data Preprocessing

I am checking for any missing values in the dataset using `df.isna().sum()`. 
`isna()` will fetch me all the null value entries which will then be summed up with `sum()` to give me the number of null values in each column. 
This will help me see how many null entries exist in each column.

In [28]:
df.isna().sum()

A1_Score            0
A2_Score            0
A3_Score            0
A4_Score            0
A5_Score            0
A6_Score            0
A7_Score            0
A8_Score            0
A9_Score            0
A10_Score           0
age                 2
gender              0
ethnicity          95
jaundice            0
country_of_res      0
used_app_before     0
result              0
age_desc            0
relation           95
Class/ASD           0
dtype: int64

Since there are missing values, I will drop rows with null values using `dropna()`. 
This ensures the data is clean and ready for processing.

In [29]:
df = df.dropna()

I am confirming the number of non-null entries after dropping the missing values.

In [30]:
df.count()

A1_Score           609
A2_Score           609
A3_Score           609
A4_Score           609
A5_Score           609
A6_Score           609
A7_Score           609
A8_Score           609
A9_Score           609
A10_Score          609
age                609
gender             609
ethnicity          609
jaundice           609
country_of_res     609
used_app_before    609
result             609
age_desc           609
relation           609
Class/ASD          609
dtype: int64

Now, I want to identify which columns are of object type.
This will help me later when I apply encoding to categorical data.

Categorical data refers to variables that contain labels or names representing different categories or groups.
Unlike numerical data, which is represented by numbers, categorical data usually consists of words, symbols, or letters.
In our dataset, examples of categorical data include columns like "gender," "ethnicity," or "jaundice."


Most machine learning models, such as Logistic Regression, SVM, KNN, and others, require numerical input.
These models can perform mathematical computations only on numbers and cannot process text-based categories directly.
Therefore, we must convert categorical data into numerical form. This process is called encoding.

In this code, we use Ordinal Encoding for categorical columns like "ethnicity," "gender," and "relation." This method assigns integer values to each category. For example, if the column "gender" contains the values "Male" and "Female," they will be encoded as 0 for "Male" and 1 for "Female."

In [50]:
object_cols = df.select_dtypes('object').columns

# I am checking the data types of each column after conversion to ensure the changes were applied correctly.
df.dtypes

A1_Score       int64
A2_Score       int64
A3_Score       int64
A4_Score       int64
A5_Score       int64
A6_Score       int64
A7_Score       int64
A8_Score       int64
A9_Score       int64
A10_Score      int64
age          float64
gender       float64
ethnicity    float64
jaundice     float64
result         int64
relation     float64
Class/ASD    float64
dtype: object

I am looking at the unique values in the "ethnicity" column to understand its categories.

In [32]:
df["ethnicity"].unique()

array(['White-European', 'Latino', 'Others', 'Black', 'Asian',
       'Middle Eastern ', 'Pasifika', 'South Asian', 'Hispanic',
       'Turkish'], dtype=object)

I am defining a helper function `encode()` that will convert categorical values in specific columns into numerical values.
I will use `OrdinalEncoder` from sklearn to perform this task.

**Ordinal Encoder** is a tool that helps you do this by turning categories into numbers in a way that keeps their order. 

Here’s how it works:
1. **Order Matters**: If you have categories that have a natural order (like "small," "medium," "large"), the Ordinal Encoder will assign numbers based on that order. For example:
   - "small" could become 1
   - "medium" could become 2
   - "large" could become 3

2. **Numbers Represent Order**: The numbers assigned reflect the order of the categories. So, "medium" being 2 means it's between "small" (1) and "large" (3).

In summary, **Ordinal Encoder** helps convert categories into numbers while keeping the order of those categories in mind.

In [38]:
def encode(col):
    for i in col:
        encoder = OrdinalEncoder()
        encoder.fit(df[[i]])
        df[[i]]= encoder.transform(df[[i]])

I am applying the encode function on several columns that contain categorical values.
These columns include details like ethnicity, gender, and the ASD classification.

In [39]:
encode(["ethnicity", "gender", "jaundice", "country_of_res", "relation", "Class/ASD"])

Checking the dataframe to see whether the function worked or not

In [40]:
df.head()

Unnamed: 0_level_0,A1_Score,A2_Score,A3_Score,A4_Score,A5_Score,A6_Score,A7_Score,A8_Score,A9_Score,A10_Score,age,gender,ethnicity,jaundice,country_of_res,used_app_before,result,age_desc,relation,Class/ASD
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
0,1,1,1,1,0,0,1,1,0,0,26.0,0.0,9.0,0.0,57.0,no,6,18 and more,4.0,0.0
1,1,1,0,1,0,0,0,1,0,1,24.0,1.0,3.0,0.0,11.0,no,5,18 and more,4.0,0.0
2,1,1,0,1,1,0,1,1,1,1,27.0,1.0,3.0,1.0,49.0,no,8,18 and more,2.0,1.0
3,1,1,0,1,0,0,1,1,0,1,35.0,0.0,9.0,0.0,57.0,no,6,18 and more,4.0,0.0
5,1,1,1,1,1,0,1,1,1,1,36.0,1.0,5.0,1.0,57.0,no,9,18 and more,4.0,1.0


I am now examining the "age_desc" column to see its value distribution.
This will help me decide whether to retain or remove the column.

In [41]:
df.age_desc.value_counts()

age_desc
18 and more    609
Name: count, dtype: int64

Based on the inspection, I decided that the "age_desc" column is not useful for my model
because **all of the rows have the same value for this column.**
This does not give us any useful information to help us to predict.  
so I will drop it using `drop()`.

P.S. - 
- The `inplace=True` is important because it will directly edit the df on which we apply it to and drop the columns.

In [42]:
df.drop(columns=["age_desc"], inplace=True)

I am checking how many unique countries are present in the "country_of_res" column.

In [43]:
df["country_of_res"].nunique()

60

Since the number of unique countries is large, 
and this might not significantly affect the model, 
I will drop this column as well.

In [44]:
df.drop(columns="country_of_res",inplace=True)

I am reviewing the values in the "used_app_before" column to determine its usefulness.

In [45]:
df["used_app_before"].value_counts()

used_app_before
no     599
yes     10
Name: count, dtype: int64

I have decided to drop the "used_app_before" column since it doesn't provide much value.

Having this much of inequality among "yes" and "no" will lead to model biasness.

In [46]:
df.drop(columns=["used_app_before"],inplace=True)

Checking the dataframe after dropping irrelevant columns.

In [47]:
df.head()

Unnamed: 0_level_0,A1_Score,A2_Score,A3_Score,A4_Score,A5_Score,A6_Score,A7_Score,A8_Score,A9_Score,A10_Score,age,gender,ethnicity,jaundice,result,relation,Class/ASD
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
0,1,1,1,1,0,0,1,1,0,0,26.0,0.0,9.0,0.0,6,4.0,0.0
1,1,1,0,1,0,0,0,1,0,1,24.0,1.0,3.0,0.0,5,4.0,0.0
2,1,1,0,1,1,0,1,1,1,1,27.0,1.0,3.0,1.0,8,2.0,1.0
3,1,1,0,1,0,0,1,1,0,1,35.0,0.0,9.0,0.0,6,4.0,0.0
5,1,1,1,1,1,0,1,1,1,1,36.0,1.0,5.0,1.0,9,4.0,1.0


Now, I am splitting the data into features (X) and target labels (y). 
The target variable is "Class/ASD", which indicates if someone has autism or not.

- I am using `train_test_split()` to divide the data into training and testing sets. 
- I will use 80% of the data for training and 20% for testing. 
- The `random_state` parameter ensures the split is reproducible.

- `df.drop()` is used here as without `inplace=True` it will return the dataframe without the column specified (here the column we will predict). Hence giving us all the columns with which we will make our predictions.

- Here `test_size` takes a decimal number from 0-1. 0.2 indicate the split to be 20%.

In [19]:
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns="Class/ASD"), df["Class/ASD"], test_size=0.2, random_state=70)

Checking the training data after the split.

In [48]:
X_train.head()

Unnamed: 0_level_0,A1_Score,A2_Score,A3_Score,A4_Score,A5_Score,A6_Score,A7_Score,A8_Score,A9_Score,A10_Score,age,gender,ethnicity,jaundice,result,relation
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
298,1,0,0,1,0,0,1,1,0,1,27.0,0.0,9.0,1.0,5,4.0
41,0,0,0,0,1,1,0,0,0,0,28.0,0.0,0.0,0.0,2,4.0
243,0,0,1,0,1,1,0,0,1,1,34.0,1.0,5.0,0.0,5,4.0
653,1,1,1,1,1,1,1,1,1,1,30.0,0.0,9.0,0.0,10,4.0
668,1,0,0,0,0,0,1,1,0,1,38.0,1.0,9.0,0.0,4,4.0


# Models

I am importing several machine learning models from sklearn.
These models include 

- logistic regression
- K-Nearest Neighbors (KNN)
- Support Vector Machines (SVM) 
- Naive Bayes
  
I am also importing the `classification_report` to evaluate model performance.

In [69]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import classification_report 

------

### 1. **K-Nearest Neighbors (KNN)**

Imagine you have a group of friends, and you want to figure out what kind of ice cream they like. You could ask a few friends who are most similar to the person you’re asking. For example, if most of their closest friends like chocolate ice cream, you might guess that they like chocolate too.

**How it works:** KNN looks at the closest examples (neighbors) in your data and assigns a label based on what most of those neighbors have. It’s like asking your friends what they like to decide what to give someone!

-----

KNN is a simple algorithm that assigns the label of the nearest neighbors. 
I am training the model using the training data (`X_train` and `y_train`), 
then predicting labels for the test set (`X_test`).

In [71]:
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

y_knn = knn.predict(X_test)

print(classification_report(y_test, y_knn))

              precision    recall  f1-score   support

         0.0       0.94      0.94      0.94        87
         1.0       0.86      0.86      0.86        35

    accuracy                           0.92       122
   macro avg       0.90      0.90      0.90       122
weighted avg       0.92      0.92      0.92       122



------

### 2. **Support Vector Machine (SVM)**
Think of this as drawing a line on a piece of paper to separate two groups of dots (for example, red dots and blue dots). SVM tries to draw this line (or hyperplane) in the best way possible so that the two groups are as far apart as possible. 

**How it works:** SVM finds the best line or boundary that keeps different groups apart. It’s good at handling lots of features and finding clear boundaries.


------

SVM is a powerful algorithm that tries to find the optimal hyperplane that separates different classes.
I am training the SVM model and evaluating its performance in a similar way to KNN.

In [79]:
svm = SVC()
svm.fit(X_train, y_train)

y_svm = svm.predict(X_test)

print(classification_report(y_test, y_svm))

              precision    recall  f1-score   support

         0.0       0.95      1.00      0.97        87
         1.0       1.00      0.86      0.92        35

    accuracy                           0.96       122
   macro avg       0.97      0.93      0.95       122
weighted avg       0.96      0.96      0.96       122



----

### 3. **Naive Bayes**
Imagine you’re trying to guess if someone likes sports based on their hobbies and interests. Naive Bayes is like making a guess based on each hobby and interest independently and then combining those guesses.

**How it works:** It uses probabilities to make a prediction. If someone likes to read and play soccer, Naive Bayes calculates the chance of them liking sports based on those hobbies and combines the chances to make a final guess.



-----
**Some Extra Info**

#### **If you are in class 9, 10 and do not understand this, it is fine you can skip this extra info part. This is just to show how this model works mathematically**

The Naive Bayes classifier is based on Bayes' Theorem, which is used to calculate the probability of a class given certain features. The formula for Naive Bayes is:

$$P(C_k \mid x_1, x_2, \ldots, x_n) = \frac{P(C_k) \cdot P(x_1, x_2, \ldots, x_n \mid C_k)}{P(x_1, x_2, \ldots, x_n)}$$

Here’s what each term means:

- **$P(C_k \mid x_1, x_2, \ldots, x_n) = \frac{P(C_k) \cdot P(x_1, x_2, \ldots, x_n \mid C_k)}{P(x_1, x_2, \ldots, x_n)}$**: The probability of class $C_k$ given the features $x_1, x_2, \ldots, x_n$.
- **$P(C_k)$**: The prior probability of class $C_k$. This is how likely it is to be in class $C_k$ before considering any features.
- **$P(x_1, x_2, \ldots, x_n | C_k)$**: The likelihood of the features $x_1, x_2, \ldots, x_n$ given class $C_k$. This is how likely the features are if we know the class is $C_k$.
- **$P(x_1, x_2, \ldots, x_n)$**: The total probability of the features $x_1, x_2, \ldots, x_n$ across all classes. This acts as a normalizing factor.

**Naive Assumption**: The “naive” part of Naive Bayes assumes that the features are independent given the class. So, instead of calculating the joint probability of all features, it simplifies to:

$$P(x_1, x_2, \ldots, x_n | C_k) = P(x_1 | C_k) \cdot P(x_2 | C_k) \cdot \ldots \cdot P(x_n | C_k)$$

This simplifies the computation, making the Naive Bayes classifier efficient even with many features.

-----

The Naive Bayes algorithm is based on applying Bayes' theorem with the assumption that features are independent.
I am using the `GaussianNB` version here, which assumes the data follows a normal distribution.

In [81]:
nb = GaussianNB()
nb.fit(X_train, y_train)

y_nb = nb.predict(X_test)

print(classification_report(y_test, y_nb))

              precision    recall  f1-score   support

         0.0       1.00      0.99      0.99        87
         1.0       0.97      1.00      0.99        35

    accuracy                           0.99       122
   macro avg       0.99      0.99      0.99       122
weighted avg       0.99      0.99      0.99       122



-----

### 4. **Logistic Regression**
Think of logistic regression as a way to predict whether someone will like a particular type of movie (yes or no) based on their preferences.

**How it works:** It looks at different features (like how much someone likes action or comedy) and calculates the chance of them liking the movie. It’s simple and helps us understand the relationship between features and the outcome.

------

**EVEN THOUGH THE NAME HAS REGRESSION IN IT. WE ARE USING IT AS A CLASSIFICATION MODEL**

Logistic regression is a linear model that estimates the probability that a given input belongs to a particular class.
I am training the logistic regression model and setting `max_iter=500` to ensure the model converges.

In [77]:
log_reg = LogisticRegression(max_iter=500)
log_reg.fit(X_train, y_train)

y_log_reg = log_reg.predict(X_test)

print(classification_report(y_test, y_log_reg))

              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        87
         1.0       1.00      1.00      1.00        35

    accuracy                           1.00       122
   macro avg       1.00      1.00      1.00       122
weighted avg       1.00      1.00      1.00       122



----

In summary:
- **KNN**: Guess based on the closest examples.
- **SVM**: Draw a line to separate groups.
- **Naive Bayes**: Use probabilities from each feature to make a prediction.
- **Logistic Regression**: Predict a yes or no outcome based on features.

----

## **How to understand classification report -**

The `classification_report` is a tool used to evaluate how well a classification model is performing. It provides several important metrics about the model’s performance. Here’s how to interpret each part of it:

### 1. **Precision**
- **What it is**: Precision measures how many of the items that the model classified as positive are actually positive.
- **How to interpret**: If precision is 1.00 (or 100%), it means that every time the model said something was a positive class (like "1.0"), it was correct. If it’s lower, it means there were some mistakes.

### 2. **Recall**
- **What it is**: Recall measures how many of the actual positive items were correctly identified by the model.
- **How to interpret**: If recall is 1.00 (or 100%), it means that the model found all the true positive items. If it’s lower, it means the model missed some positive items.

### 3. **F1-Score**
- **What it is**: The F1-Score is a combination of precision and recall. It’s the average of precision and recall, giving a single score that balances both.
- **How to interpret**: An F1-Score of 1.00 means the model is perfect at both precision and recall. Lower scores indicate that there are trade-offs between precision and recall.

### 4. **Support**
- **What it is**: Support is the number of actual occurrences of each class in the dataset.
- **How to interpret**: It tells you how many examples of each class were in the dataset. For example, there are 87 examples of class 0.0 and 35 examples of class 1.0.

### 5. **Accuracy**
- **What it is**: Accuracy measures how many predictions the model got right overall.
- **How to interpret**: An accuracy of 1.00 (or 100%) means the model got all the predictions correct.

### 6. **Macro Average**
- **What it is**: Macro average calculates the precision, recall, and F1-Score by averaging them across all classes, treating each class equally.
- **How to interpret**: This gives you an idea of the model’s performance across all classes without being affected by the number of examples in each class.

### 7. **Weighted Average**
- **What it is**: Weighted average calculates the precision, recall, and F1-Score by taking into account the number of examples in each class.
- **How to interpret**: This provides an overall view of performance, considering how many examples are in each class.

----

Usually we do not get a full accuracy in a model unless the dataset is very non-complex or there is a very clear relation between the features and the thing to be predicted. Here although I have specifically choosen a dataset which is easy so we could easily get near 100% and in one case we did.

I was very surprised to see logistic regression get 100% accuracy. You will learn while doing other projects that it is quite a low-performing model.
But in this case the logistic regression model has suited the dataset quite well.

Logistic regression got the highest accuracy followed by naive_bayes, svm and knn.

You can further tweak these models by changing their hyperparameters. and giving different values. I went with the default ones other than `max_iter=500` in logistic regression

----

**Hyperparameters** are special settings that you choose before training a machine learning model. They help control how the model learns from the data and can affect how well it performs.

**Think of it like this:**
- **Recipe**: When you cook, you follow a recipe. The ingredients and amounts you use are like hyperparameters.
- **Adjusting the Recipe**: Just as you might change the amount of sugar or cook time to make the dish taste better, you adjust hyperparameters to make the model perform better.

**Examples:**
- **Number of Neighbors (K)**: In K-Nearest Neighbors, you choose how many nearby points to look at to make a decision.
- **Learning Rate**: In some models, this decides how quickly the model learns from the data.

**Why They Matter:**
- Choosing the right settings helps the model learn more accurately and make better predictions.