# Notebook by-
## Himel Sarder
## Gmail : info.himelcse@gmail.com

![image.png](attachment:0480f58d-85d4-4819-8e2b-24ad6184be09.png)

### **Feature Splitting in Machine Learning & Data Processing**  

**Feature Splitting** is the process of breaking down a single feature (column) into multiple new features to improve model performance and data understanding. It is useful when a feature contains multiple pieces of information that can be separated for better analysis.  


## **Examples of Feature Splitting**
### **1️⃣ Splitting a Full Name into First Name and Last Name**
In a dataset, if we have a column called `Full Name`, we can split it into `First Name` and `Last Name` to enable better search functionality.

#### **Python Example:**
```python
import pandas as pd

# Sample dataset
data = {'Full Name': ['John Doe', 'Alice Smith', 'Michael Jordan']}
df = pd.DataFrame(data)

# Splitting Full Name into First and Last Name
df[['First Name', 'Last Name']] = df['Full Name'].str.split(' ', 1, expand=True)

print(df)
```
🔹 **Result:**
| Full Name     | First Name | Last Name  |
|--------------|-----------|------------|
| John Doe     | John      | Doe        |
| Alice Smith  | Alice     | Smith      |
| Michael Jordan | Michael  | Jordan    |


### **2️⃣ Splitting Date into Year, Month, and Day**
When working with a **Library Management System** or **Car Sales Website**, dates (such as publication date or car listing date) can be split into year, month, and day for better filtering and analysis.

#### **Python Example:**
```python
# Creating a DataFrame with a date column
df = pd.DataFrame({'Date': pd.to_datetime(['2023-05-15', '2018-10-20', '2021-07-30'])})

# Splitting Date into Year, Month, and Day
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day

print(df)
```
🔹 **Result:**
| Date        | Year | Month | Day |
|------------|------|-------|-----|
| 2023-05-15 | 2023 | 5     | 15  |
| 2018-10-20 | 2018 | 10    | 20  |
| 2021-07-30 | 2021 | 7     | 30  |

✅ This helps in analyzing seasonal trends, filtering data by year/month, and improving visualizations.


### **3️⃣ Extracting Domain from Email Address**
If we have an email column in a user database, we can extract the **domain** (e.g., `gmail.com`, `yahoo.com`) for user segmentation.

#### **Python Example:**
```python
df = pd.DataFrame({'Email': ['john@gmail.com', 'alice@yahoo.com', 'mike@outlook.com']})

# Extracting the email domain
df['Domain'] = df['Email'].str.split('@').str[1]

print(df)
```
🔹 **Result:**
| Email             | Domain       |
|------------------|-------------|
| john@gmail.com   | gmail.com   |
| alice@yahoo.com  | yahoo.com   |
| mike@outlook.com | outlook.com |

✅ This can help analyze which email providers are most used by customers.


### **4️⃣ Splitting Address into Street, City, and State**
For a **Car Sales Website**, addresses may contain street, city, and state. Splitting this information makes it easier to filter cars based on location.

#### **Python Example:**
```python
df = pd.DataFrame({'Address': ['123 Main St, Los Angeles, CA', '456 Park Ave, New York, NY']})

# Splitting Address into Street, City, and State
df[['Street', 'City', 'State']] = df['Address'].str.split(', ', expand=True)

print(df)
```
🔹 **Result:**
| Address                         | Street       | City        | State |
|--------------------------------|-------------|------------|-------|
| 123 Main St, Los Angeles, CA  | 123 Main St | Los Angeles | CA    |
| 456 Park Ave, New York, NY    | 456 Park Ave | New York    | NY    |

✅ This makes filtering and searching by city/state easier.


## **Why is Feature Splitting Important?**
1. **Improves Data Quality** – More structured and meaningful data.
2. **Enhances Model Performance** – Helps machine learning models recognize patterns better.
3. **Facilitates Filtering & Searching** – Useful for database queries in web applications.
4. **Aids Data Visualization** – Better insights when visualizing split features.

In [2]:
import pandas as pd
import numpy as np

## Feature Splitting

In [9]:
df = pd.read_csv('train.csv')

In [10]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [11]:
df['Name']

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

In [12]:
df['Title'] = df['Name'].str.split(', ', expand=True)[1].str.split('.', expand=True)[0]

In [22]:
df['Name'].str.split(', ', expand=True)

Unnamed: 0,0,1
0,Braund,Mr. Owen Harris
1,Cumings,Mrs. John Bradley (Florence Briggs Thayer)
2,Heikkinen,Miss. Laina
3,Futrelle,Mrs. Jacques Heath (Lily May Peel)
4,Allen,Mr. William Henry
...,...,...
886,Montvila,Rev. Juozas
887,Graham,Miss. Margaret Edith
888,Johnston,"Miss. Catherine Helen ""Carrie"""
889,Behr,Mr. Karl Howell


In [24]:
df['Name'].str.split(', ', expand=True)

Unnamed: 0,0,1
0,Braund,Mr. Owen Harris
1,Cumings,Mrs. John Bradley (Florence Briggs Thayer)
2,Heikkinen,Miss. Laina
3,Futrelle,Mrs. Jacques Heath (Lily May Peel)
4,Allen,Mr. William Henry
...,...,...
886,Montvila,Rev. Juozas
887,Graham,Miss. Margaret Edith
888,Johnston,"Miss. Catherine Helen ""Carrie"""
889,Behr,Mr. Karl Howell


In [25]:
df['Name'].str.split(', ', expand=True)[1].str.split('.', expand=True)

Unnamed: 0,0,1,2
0,Mr,Owen Harris,
1,Mrs,John Bradley (Florence Briggs Thayer),
2,Miss,Laina,
3,Mrs,Jacques Heath (Lily May Peel),
4,Mr,William Henry,
...,...,...,...
886,Rev,Juozas,
887,Miss,Margaret Edith,
888,Miss,"Catherine Helen ""Carrie""",
889,Mr,Karl Howell,


In [14]:
df[['Title','Name']]

Unnamed: 0,Title,Name
0,Mr,"Braund, Mr. Owen Harris"
1,Mrs,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,Miss,"Heikkinen, Miss. Laina"
3,Mrs,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
4,Mr,"Allen, Mr. William Henry"
...,...,...
886,Rev,"Montvila, Rev. Juozas"
887,Miss,"Graham, Miss. Margaret Edith"
888,Miss,"Johnston, Miss. Catherine Helen ""Carrie"""
889,Mr,"Behr, Mr. Karl Howell"


In [26]:
df['Name'].str.split(', ', expand=True)[1].str.split('.', expand=True)[0]

0        Mr
1       Mrs
2      Miss
3       Mrs
4        Mr
       ... 
886     Rev
887    Miss
888    Miss
889      Mr
890      Mr
Name: 0, Length: 891, dtype: object

In [18]:
df.groupby('Title')[['Survived']].mean().sort_values(by='Survived', ascending=False)

Unnamed: 0_level_0,Survived
Title,Unnamed: 1_level_1
the Countess,1.0
Mlle,1.0
Sir,1.0
Ms,1.0
Lady,1.0
Mme,1.0
Mrs,0.792
Miss,0.697802
Master,0.575
Col,0.5


In [20]:
df['Is_Married'] = 0
df.loc[df['Title'] == 'Mrs', 'Is_Married'] = 1

In [21]:
df['Is_Married']

0      0
1      1
2      0
3      1
4      0
      ..
886    0
887    0
888    0
889    0
890    0
Name: Is_Married, Length: 891, dtype: int64

# Thank You