# Train-Test-Split

**What is train_test_split in machine Learning?**

In Scikit-learn, train_test split is a function used to create training and testing data to be usert to measure a machine learning model's performance.





**Why Use Train Test Split in Machine Learning?**

In machine leaming, we often build or train models on a single dataset To evaluate if a machine learning model is doing as expected, we need to train the model on one portion of the dataset, and compare how accurately the predictions map to the real-world data

To evaluate the accuracy of machine learning models, data scientists need to split datasets in two portions called

-training data (train the model)

-testing set (test the model)

**How to Use Train Test Split?**

- Split a dataset into a training and testing set

- Provide the testing size with the test_size parameter

- Train a model on the training set

- Make predictions on the training set

- Compute the accuracy with a metrics such as the accuracy or accuracy_score

In [1]:
import pandas as pd

In [3]:
housing = pd.read_csv('/content/housing.csv')
housing

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity,median_house_value
0,-122.23,37.88,41,880,129.0,322,126,8.3252,NEAR BAY,452600
1,-122.22,37.86,21,7099,1106.0,2401,1138,8.3014,NEAR BAY,358500
2,-122.24,37.85,52,1467,190.0,496,177,7.2574,NEAR BAY,352100
3,-122.25,37.85,52,1274,235.0,558,219,5.6431,NEAR BAY,341300
4,-122.25,37.85,52,1627,280.0,565,259,3.8462,NEAR BAY,342200
...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25,1665,374.0,845,330,1.5603,INLAND,78100
20636,-121.21,39.49,18,697,150.0,356,114,2.5568,INLAND,77100
20637,-121.22,39.43,17,2254,485.0,1007,433,1.7000,INLAND,92300
20638,-121.32,39.43,18,1860,409.0,741,349,1.8672,INLAND,84700


In [4]:
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity,median_house_value
0,-122.23,37.88,41,880,129.0,322,126,8.3252,NEAR BAY,452600
1,-122.22,37.86,21,7099,1106.0,2401,1138,8.3014,NEAR BAY,358500
2,-122.24,37.85,52,1467,190.0,496,177,7.2574,NEAR BAY,352100
3,-122.25,37.85,52,1274,235.0,558,219,5.6431,NEAR BAY,341300
4,-122.25,37.85,52,1627,280.0,565,259,3.8462,NEAR BAY,342200


In [5]:
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  int64  
 3   total_rooms         20640 non-null  int64  
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  int64  
 6   households          20640 non-null  int64  
 7   median_income       20640 non-null  float64
 8   ocean_proximity     20640 non-null  object 
 9   median_house_value  20640 non-null  int64  
dtypes: float64(4), int64(5), object(1)
memory usage: 1.6+ MB


In [6]:
y = housing.median_income

In [7]:
X = housing.drop('median_income', axis = 1)

In [8]:
X

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,ocean_proximity,median_house_value
0,-122.23,37.88,41,880,129.0,322,126,NEAR BAY,452600
1,-122.22,37.86,21,7099,1106.0,2401,1138,NEAR BAY,358500
2,-122.24,37.85,52,1467,190.0,496,177,NEAR BAY,352100
3,-122.25,37.85,52,1274,235.0,558,219,NEAR BAY,341300
4,-122.25,37.85,52,1627,280.0,565,259,NEAR BAY,342200
...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25,1665,374.0,845,330,INLAND,78100
20636,-121.21,39.49,18,697,150.0,356,114,INLAND,77100
20637,-121.22,39.43,17,2254,485.0,1007,433,INLAND,92300
20638,-121.32,39.43,18,1860,409.0,741,349,INLAND,84700


In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [11]:
#Print the shapes of testing and training sets

print(housing.shape)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(20640, 10)
(16512, 9)
(4128, 9)
(16512,)
(4128,)


# Underfitting and Overfitting

**Diagnosing Underfitting:**


Underfitting occurs when a model is too simplistic to capture the underlying patterns in the data. It performs poorly on both the training and test sets. Signs of underfitting include:

- Low training and test performance (low accuracy, high error).

- Consistently poor performance across different datasets or folds in cross-validation.

- Model doesn't seem to learn from the training data.


**Addressing Underfitting:**

- **Increase Model Complexity:** Consider using a more complex model with more parameters, such as using deeper neural networks, higher-degree polynomial regression, or more complex algorithms.

- **Feature Engineering**: Add more relevant features to the dataset to provide the model with more information.

- **Fine-tuning Hyperparameters:** Adjust hyperparameters like learning rate, regularization strength, or the number of hidden units/layers in a neural network.

- **Reduce Regularization:** If you're using regularization techniques, consider reducing the strength of regularization or using a different type.


**Reasons for Underfitting:**

- High bias and low variance.

- The size of the training dataset used is not enough.

- The model is too simple.

- Training data is not cleaned and also contains noise in it.


**Techniques to Reduce Underfitting:**

- Increase model complexity.

- Increase the number of features, performing feature engineering.

- Remove noise from the data.

- Increase the number of epochs or increase the duration of training to get better results.


**Diagnosing Overfitting:**

Overfitting occurs when a model becomes too flexible and fits the training data noise and outliers. It performs very well on the training set but poorly on the test set. Signs of overfitting include:

- High training performance but significantly lower test performance.

- Large differences between training and test performance.

- Model captures noise and fluctuations in the training data.


**Addressing Overfitting:**

- **Regularization:** Apply regularization techniques to penalize overly complex models. Common methods include L1 regularization (Lasso), L2 regularization (Ridge), and dropout in neural networks.

- **Feature Selection:** Remove irrelevant or noisy features that might be contributing to overfitting.

- **More Data:** Increase the size of your training dataset to provide the model with more examples to learn from.

- **Early Stopping:** Monitor the performance on the validation set during training and stop training when performance starts to degrade.

- **Simpler Model:** Consider using a simpler model architecture with fewer parameters. Ensemble Methods: Combine predictions from multiple models to reduce overfitting.


**Reasons for Overfitting:**

- High variance and low bias.

- The model is too complex.

- The size of the training data.

- Techniques to Reduce Overfitting:

- Increase training data.

- Reduce model complexity.

- Early stopping during the training phase (have an eye over the loss over the training period as soon as loss begins to increase stop training).

- Ridge Regularization and Lasso Regularization.

- Use dropout for neural networks to tackle overfitting.

