# Data Distribution

In machine learning, data division (also called data splitting) is the process of dividing your dataset into different subsets so you can train and evaluate your model properly. It helps prevent overfitting, ensures generalization, and helps measure the true performance of your model on unseen data.

# 🔹 Common Types of Data Division:
# Training Set

Used to train the machine learning model.

The model learns patterns, weights, and relationships from this data.

Usually makes up 60–80% of the total dataset.

# Test Set
Used to evaluate the final model's performance after training.

It should never be seen or used during training or hyperparameter tuning.

Usually 10–20% of the dataset.



In [1]:
# We Divide our data to train our machine we have to give it input and target

# Data ---- Independent data (x) , Dependent data (y)

# x ----> x_training data, x_testing data
# y ----> y_training data, y_testing data

# Machine ----> for training ---->(x_training,y_training)
# 2+2=4
# 3+2=5

# Machine ----> for prediction ---->(x_testing,y_testing)
# 6+5= ???
# (y_test) = 11

# 🔹 What is Dataset?
A dataset is a collection of data used to train and test a machine learning model.
It usually has:

Input features (independent variables) → e.g., house size, number of bedrooms

Output/label (dependent variable) → e.g., price



# 🔹 What is Data Splitting?
We divide the dataset into two parts:

Training Set → used to teach the model

Testing Set → used to check how well the model learned


# 🔹 Why Split the Data?
To train the model on one part (X_train, y_train)

To test accuracy on new unseen data (X_test, y_test)

To avoid overfitting (when model memorizes training data but performs badly on new data)



In [3]:
import numpy as np
import pandas as pd

In [4]:
df = pd.read_csv("placement.csv")

In [5]:
df.head() #gives top 5 data

Unnamed: 0,cgpa,resume_score,placed
0,8.14,6.52,1
1,6.17,5.17,0
2,8.27,8.86,1
3,6.88,7.27,1
4,7.52,7.3,1


In [6]:
df.shape #rows and columns

(100, 3)

In [7]:
df.isnull().sum()

cgpa            0
resume_score    0
placed          0
dtype: int64

# Dividing Dataset

In [8]:
x = df.drop(columns = ['placed']) #input columns
y = df['placed'] #target data

In [9]:
from sklearn.model_selection import train_test_split #helps us to divide data

In [10]:
x_train, x_test, y_train, y_test = train_test_split(x,
                                                   y,
                                                   test_size = 0.2,
                                                   random_state = 42)

In [11]:
print("Total DataFrame Shape :",df.shape)
print("-----------------------")
print("-----------------------")

print("Input data (x) shape : ",x.shape)
print("x_train data shape : ",x_train.shape)
print("x_test data shape : ",x_test.shape)
print("-----------------------")
print("Target data (y) shape : ",y.shape)
print("x_train data shape : ",y_train.shape)
print("x_test data shape : ",y_test.shape)
print("-----------------------")

Total DataFrame Shape : (100, 3)
-----------------------
-----------------------
Input data (x) shape :  (100, 2)
x_train data shape :  (80, 2)
x_test data shape :  (20, 2)
-----------------------
Target data (y) shape :  (100,)
x_train data shape :  (80,)
x_test data shape :  (20,)
-----------------------


# Data 2

In [12]:
df = pd.read_csv("covid_toy.csv")

In [13]:
df.isnull().sum()

age           0
gender        0
fever        10
cough         0
city          0
has_covid     0
dtype: int64

In [16]:
from sklearn.impute import SimpleImputer #for filling missing values.

In [17]:
si = SimpleImputer(strategy = 'mean') 

In [18]:
df['fever'] = si.fit_transform(df[['fever']]) #for filling fever missing values

In [19]:
df.isnull().sum()

age          0
gender       0
fever        0
cough        0
city         0
has_covid    0
dtype: int64

In [20]:
df.head(3)

Unnamed: 0,age,gender,fever,cough,city,has_covid
0,60,Male,103.0,Mild,Kolkata,No
1,27,Male,100.0,Mild,Delhi,Yes
2,42,Male,101.0,Mild,Delhi,No


In [21]:
df.shape

(100, 6)

In [23]:
x = df.drop(columns=['has_covid'])
y = df['has_covid']

In [24]:
from sklearn.model_selection import train_test_split

In [25]:
x_train, x_test, y_train, y_test = train_test_split(x,
                                                   y,
                                                   test_size = 0.2,
                                                   random_state = 42)

In [26]:
print("Total DataFrame Shape :",df.shape)
print("-----------------------")
print("-----------------------")

print("Input data (x) shape : ",x.shape)
print("x_train data shape : ",x_train.shape)
print("x_test data shape : ",x_test.shape)
print("-----------------------")
print("Target data (y) shape : ",y.shape)
print("x_train data shape : ",y_train.shape)
print("x_test data shape : ",y_test.shape)
print("-----------------------")

Total DataFrame Shape : (100, 6)
-----------------------
-----------------------
Input data (x) shape :  (100, 5)
x_train data shape :  (80, 5)
x_test data shape :  (20, 5)
-----------------------
Target data (y) shape :  (100,)
x_train data shape :  (80,)
x_test data shape :  (20,)
-----------------------
