# Data Spliting In Machine Learning

## Splitting Data into Features (X) and Target (y) in Machine Learning

##### Why Split Data into X and y?
- It is one of the most fundamental pastr of model building especially in case of supervised machine learning, as the model learn from data by understaning the relationship betwwen the input data (Featues) & Output data (Targets).
- Feature(X): The input variable (what you use to make predictions)
- Targer(y): The output variable (what you want to predict)
- Syntax: 
```
X = df[['input feature 1', 'input feature 2', ....]]
y = df['output feature(target)]
```

## Ways to Split the Data

##### 1. .drop() and column selection
- Syntax: 
```
X = df.drop("Traget Feature", axis= 1)
y = df['output feature(target)]
```

- drop() removes the target column from the features
- axis=1 means drop a column, not a row


##### 2. Using .iloc (index-based)
- Syntax:
```
X = df.iloc[ :, :-1]
Y = df.iloc[ :, -1]
```

- :-1 excludes the last column → feature set
- -1 gets the last column → target


##### 3. Select specific columns manually
- Syntax: 
```
X = df[['input feature 1', 'input feature 2', ....]]
y = df['output feature(target)]
```

### Example

In [1]:
# import Libraries
import pandas as pd

In [2]:
ds = pd.read_csv('Sales_data.csv')
ds.head()

Unnamed: 0,Group,Customer_Segment,Sales_Before,Sales_After,Customer_Satisfaction_Before,Customer_Satisfaction_After,Purchase_Made
0,Control,High Value,240.548359,300.007568,74.684767,,No
1,Treatment,High Value,246.862114,381.337555,100.0,100.0,Yes
2,Control,High Value,156.978084,179.330464,98.780735,100.0,No
3,Control,Medium Value,192.126708,229.278031,49.333766,39.811841,Yes
4,,High Value,229.685623,,83.974852,87.738591,Yes


In [None]:
# split the data
X = ds.drop('Purchase_Made', axis = 1)
Y = ds['Purchase_Made']
print(X)
print(Y)

          Group Customer_Segment  Sales_Before  Sales_After  \
0       Control       High Value    240.548359   300.007568   
1     Treatment       High Value    246.862114   381.337555   
2       Control       High Value    156.978084   179.330464   
3       Control     Medium Value    192.126708   229.278031   
4           NaN       High Value    229.685623          NaN   
...         ...              ...           ...          ...   
9995  Treatment              NaN    259.695935   415.181694   
9996    Control       High Value    186.488285   216.225457   
9997  Treatment        Low Value    208.107142   322.893351   
9998  Treatment     Medium Value           NaN   431.974901   
9999    Control        Low Value           NaN   124.402398   

      Customer_Satisfaction_Before  Customer_Satisfaction_After  
0                        74.684767                          NaN  
1                       100.000000                   100.000000  
2                        98.780735           

---

# Train-Test Data Split

- The train-test split is the process of dividing your dataset into two parts:
- Training Set: Used to train the model (learn patterns)
- Test Set: Used to evaluate model performance on unseen data

### Typical Split Ratio
- Training Set: 70% - 80%
- Test Set: 20% - 30%

##### Importing Train-Test Split
```
from sklearn.model_selection import train_test_split
```

##### Syntax
```
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,           # 20% for testing
    random_state=42,         # Reproducibility
    shuffle=True             # Shuffle before split
)
```

### Parameters

| Parameter      | Description                                      |
| -------------- | ------------------------------------------------ |
| `X, y`         | Features and labels                              |
| `test_size`    | Fraction of data for testing (e.g., 0.2 = 20%)   |
| `random_state` | Ensures same split every run                     |
| `shuffle`      | Shuffles data before splitting                   |
| `stratify`     | Ensures class balance in classification problems |


### Output Structure
| Variable  | Contains                  |
| --------- | ------------------------- |
| `X_train` | Feature data for training |
| `X_test`  | Feature data for testing  |
| `y_train` | Labels for training       |
| `y_test`  | Labels for testing        |

### Detailed Explanation of train_test_split() Parameters

#####  1. X, y → Features and Labels
- These are the inputs to the function.
- X: The input features (usually a DataFrame or NumPy array).
- y: The target/output variable (usually a Series or 1D array).
- Example: 
```
X = df.drop("Outcome", axis=1)  # all input columns
y = df["Outcome"]               # target column
```

---
#####  2. test_size → Size of Test Set
- Defines the proportion of the dataset to include in the test split.
-It can be:
A float between 0.0 and 1.0 → fraction of dataset (most common)
An int → absolute number of test samples
- Example: 
```
train_test_split(X, y, test_size=0.2)  # 20% test, 80% train
train_test_split(X, y, test_size=100)  # 100 samples in test set
```

- Common values:
0.2 → 20% test
0.3 → 30% test

---
#####  3. random_state → Reproducibility
- Controls the shuffling of data before the split.
- Setting this to a fixed number makes the split repeatable every time you run the code.
- If not set, you may get a different train-test split on each run.
- Example: 
```
train_test_split(X, y, test_size=0.3, random_state=42)
```
-  Without random_state, your model might behave differently each time.
-  Use a fixed random_state (like 0, 42, or any number) for consistency in experiments.

---

#####  4. shuffle → Whether to Shuffle the Data Before Splitting
- True (default): Data is shuffled before splitting.
- False: No shuffling – useful in time series or ordered data.
- Example: 
```
train_test_split(X, y, test_size=0.2, shuffle=True)
```
-   Use shuffle=False if:
- You’re dealing with time-dependent sequences (e.g., stock prices)
- Data order has meaning

---

##### 5. stratify → Stratified Sampling Based on Labels
- Ensures class proportions are preserved in both train and test sets (important for classification problems with imbalanced classes).
- If you have 80% class A and 20% class B in y, setting stratify=y will ensure the train and test sets also maintain this ratio.
- Example:
```
train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)
```
- If stratify is not used in imbalanced datasets, the model might train on mostly one class and fail to generalize.

# Example 

In [4]:
# Import Libraries
import pandas as pd
from sklearn.model_selection import train_test_split

In [5]:
# load data set
ds = pd.read_csv('Sales_data.csv')
ds.head()

Unnamed: 0,Group,Customer_Segment,Sales_Before,Sales_After,Customer_Satisfaction_Before,Customer_Satisfaction_After,Purchase_Made
0,Control,High Value,240.548359,300.007568,74.684767,,No
1,Treatment,High Value,246.862114,381.337555,100.0,100.0,Yes
2,Control,High Value,156.978084,179.330464,98.780735,100.0,No
3,Control,Medium Value,192.126708,229.278031,49.333766,39.811841,Yes
4,,High Value,229.685623,,83.974852,87.738591,Yes


In [8]:
# Split the data into X, y
X = ds.drop('Purchase_Made', axis= 1)
y = ds['Purchase_Made']
X, y

(          Group Customer_Segment  Sales_Before  Sales_After  \
 0       Control       High Value    240.548359   300.007568   
 1     Treatment       High Value    246.862114   381.337555   
 2       Control       High Value    156.978084   179.330464   
 3       Control     Medium Value    192.126708   229.278031   
 4           NaN       High Value    229.685623          NaN   
 ...         ...              ...           ...          ...   
 9995  Treatment              NaN    259.695935   415.181694   
 9996    Control       High Value    186.488285   216.225457   
 9997  Treatment        Low Value    208.107142   322.893351   
 9998  Treatment     Medium Value           NaN   431.974901   
 9999    Control        Low Value           NaN   124.402398   
 
       Customer_Satisfaction_Before  Customer_Satisfaction_After  
 0                        74.684767                          NaN  
 1                       100.000000                   100.000000  
 2                        98.

In [10]:
# split the dataset in train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y,  test_size=0.30, random_state=42, shuffle=True)

# Print the data
X_train, X_test, y_train, y_test

(          Group Customer_Segment  Sales_Before  Sales_After  \
 9069    Control              NaN    291.106024   339.189101   
 2603    Control        Low Value    116.073014   134.793027   
 7738  Treatment     Medium Value    231.294717          NaN   
 1579        NaN              NaN    283.903517   338.551699   
 5058    Control     Medium Value    167.653035   201.148619   
 ...         ...              ...           ...          ...   
 5734  Treatment     Medium Value    211.298523   329.946118   
 5191  Treatment       High Value    171.580545   262.206431   
 5390  Treatment        Low Value    177.647075   282.720969   
 860   Treatment     Medium Value    189.962855   277.573511   
 7270  Treatment        Low Value    116.952862   189.670308   
 
       Customer_Satisfaction_Before  Customer_Satisfaction_After  
 9069                     68.671932                    63.550297  
 2603                     64.859218                    72.553231  
 7738                     69.