## 🧱 Feature Construction (ফিচার কনস্ট্রাকশন)

**Feature Construction** মানে হচ্ছে নতুন ফিচার তৈরি করা যেটা আমাদের মডেলকে ভালোভাবে শেখাতে সাহায্য করে।

### ✅ সাধারণ কিছু Feature Construction এর পদ্ধতি:

1. **Polynomial Features (পলিনোমিয়াল ফিচারস)**
   ➤ এক বা একাধিক সংখ্যাগত ফিচারকে পাওয়ার বা একে অপরের সাথে গুণ করে নতুন ফিচার তৈরি করা।
   উদাহরণ: `Age`, `Age²`, `Age * Fare`

2. **Binning (বিনিং)**
   ➤ কোনো continuous ফিচারকে ভাগ করে ক্যাটাগরিতে রূপান্তর করা।
   যেমন:

   * 0–12 বছর → “Child”
   * 13–59 বছর → “Adult”
   * 60+ বছর → “Senior”

3. **Interaction Features**
   ➤ একাধিক ফিচারকে একত্রে নিয়ে নতুন ফিচার বানানো (যেমন multiplication)।
   উদাহরণ: `Income * Education_Level`

4. **Datetime Features**
   ➤ তারিখ থেকে নতুন ফিচার বের করা, যেমন: year, month, day ইত্যাদি।
   উদাহরণ:

   ```python
   df['Year'] = pd.to_datetime(df['Date']).dt.year
   ```

5. **Text থেকে ফিচার তৈরি**
   ➤ টেক্সট ডেটা থেকে সংখ্যা ভিত্তিক ফিচার বের করা, যেমন CountVectorizer বা TF-IDF ব্যবহার করে।
   উদাহরণ:

   ```python
   from sklearn.feature_extraction.text import TfidfVectorizer
   ```

---

## ✂️ Feature Splitting (ফিচার স্প্লিটিং)

**Feature Splitting** মানে হলো একটি ফিচার থেকে একাধিক নতুন ফিচার বের করা।

### ✅ উদাহরণ:

1. **Full Name → First Name + Last Name**

   ```python
   df['First_Name'] = df['Full_Name'].str.split().str[0]
   df['Last_Name'] = df['Full_Name'].str.split().str[1]
   ```

2. **Date → Day, Month, Year**

   ```python
   df['Day'] = pd.to_datetime(df['Date']).dt.day
   df['Month'] = pd.to_datetime(df['Date']).dt.month
   df['Year'] = pd.to_datetime(df['Date']).dt.year
   ```

3. **Address → City, State, Zip Code**
4. **Name+Gender --> Maritaile Status**
5. **Marks,Study Hours --> IQ**

---

🔍 **সংক্ষেপে বললে:**

* **Feature Construction** = নতুন ফিচার বানানো
* **Feature Splitting** = একটি ফিচারকে ভাগ করে ছোট ছোট ফিচার বানানো



In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df=pd.read_csv('/content/Titanic-Dataset.csv')
df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [None]:
df=df.iloc[:,[3,5,4,6,7,2,1]]
df.head(3)

Unnamed: 0,Name,Age,Sex,SibSp,Parch,Pclass,Survived
0,"Braund, Mr. Owen Harris",22.0,male,1,0,3,0
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,female,1,0,1,1
2,"Heikkinen, Miss. Laina",26.0,female,0,0,3,1


##Train model Without Feature Construction or Spliting

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(df.drop(['Name','Survived'],axis=1),df['Survived'],test_size=0.2,random_state=2)
X_train.head(2)

Unnamed: 0,Age,Sex,SibSp,Parch,Pclass
30,40.0,male,0,0,1
10,4.0,female,1,1,3


In [None]:
from sklearn.preprocessing import OneHotEncoder
ohe=OneHotEncoder(sparse_output=False,dtype=np.int32)
X_train_age=ohe.fit_transform(X_train[['Sex']])

X_train_age=pd.DataFrame(X_train_age,columns=['sex_male','sex_female'])

X_test_age=pd.DataFrame(ohe.transform(X_test[['Sex']]),columns=(['sex_male','sex_female']))
X_test_age,X_train_age

(     sex_male  sex_female
 0           0           1
 1           0           1
 2           1           0
 3           0           1
 4           1           0
 ..        ...         ...
 174         0           1
 175         0           1
 176         0           1
 177         0           1
 178         1           0
 
 [179 rows x 2 columns],
      sex_male  sex_female
 0           0           1
 1           1           0
 2           0           1
 3           0           1
 4           0           1
 ..        ...         ...
 707         1           0
 708         0           1
 709         0           1
 710         0           1
 711         0           1
 
 [712 rows x 2 columns])

In [None]:
new_X_train=pd.concat([X_train.reset_index(),X_train_age.reset_index()],axis=1).drop('Sex',axis=1)
new_X_train

Unnamed: 0,index,Age,SibSp,Parch,Pclass,index.1,sex_male,sex_female
0,30,40.0,0,0,1,0,0,1
1,10,4.0,1,1,3,1,1,0
2,873,47.0,0,0,3,2,0,1
3,182,9.0,4,2,3,3,0,1
4,876,20.0,0,0,3,4,0,1
...,...,...,...,...,...,...,...,...
707,534,30.0,0,0,3,707,1,0
708,584,,0,0,3,708,0,1
709,493,71.0,0,0,1,709,0,1
710,527,,0,0,1,710,0,1


In [None]:
new_df=np.hstack([X_test,X_test_age])
new_df
#return Numpy Array

array([[42.0, 'male', 0, ..., 1, 0, 1],
       [21.0, 'male', 0, ..., 3, 0, 1],
       [24.0, 'female', 1, ..., 2, 1, 0],
       ...,
       [nan, 'male', 8, ..., 3, 0, 1],
       [26.0, 'male', 0, ..., 3, 0, 1],
       [29.0, 'female', 1, ..., 3, 1, 0]], dtype=object)

In [None]:
new_X_test=pd.concat([X_test.reset_index(),X_test_age.reset_index()],axis=1).drop('Sex',axis=1)
new_X_test

Unnamed: 0,index,Age,SibSp,Parch,Pclass,index.1,sex_male,sex_female
0,707,42.0,0,0,1,0,0,1
1,37,21.0,0,0,3,1,0,1
2,615,24.0,1,2,2,2,1,0
3,169,28.0,0,0,3,3,0,1
4,68,17.0,4,2,3,4,1,0
...,...,...,...,...,...,...,...,...
174,89,24.0,0,0,3,174,0,1
175,80,22.0,0,0,3,175,0,1
176,846,,8,2,3,176,0,1
177,870,26.0,0,0,3,177,0,1


In [None]:
from sklearn.tree import DecisionTreeClassifier
model=DecisionTreeClassifier()
model.fit(new_X_train,y_train)

In [None]:
y_pred=model.predict(new_X_test)
y_pred

array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0,
       0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0,
       0, 0, 0])

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_pred,y_test)

0.7206703910614525

In [None]:
from sklearn.model_selection import cross_val_score
cross_val_score(model,new_X_train,y_train,cv=5,scoring='accuracy').mean()

np.float64(0.6926425686989066)

##With Feature Construction

In [None]:
new_X_test

Unnamed: 0,index,Age,SibSp,Parch,Pclass,index.1,sex_male,sex_female
0,707,42.0,0,0,1,0,0,1
1,37,21.0,0,0,3,1,0,1
2,615,24.0,1,2,2,2,1,0
3,169,28.0,0,0,3,3,0,1
4,68,17.0,4,2,3,4,1,0
...,...,...,...,...,...,...,...,...
174,89,24.0,0,0,3,174,0,1
175,80,22.0,0,0,3,175,0,1
176,846,,8,2,3,176,0,1
177,870,26.0,0,0,3,177,0,1


In [None]:
new_X_train['family']=new_X_train['SibSp']+new_X_train['Parch']
new_X_test['family']=new_X_test['SibSp']+new_X_test['Parch']

In [None]:
new_X_train

Unnamed: 0,index,Age,SibSp,Parch,Pclass,index.1,sex_male,sex_female,family
0,30,40.0,0,0,1,0,0,1,0
1,10,4.0,1,1,3,1,1,0,2
2,873,47.0,0,0,3,2,0,1,0
3,182,9.0,4,2,3,3,0,1,6
4,876,20.0,0,0,3,4,0,1,0
...,...,...,...,...,...,...,...,...,...
707,534,30.0,0,0,3,707,1,0,0
708,584,,0,0,3,708,0,1,0
709,493,71.0,0,0,1,709,0,1,0
710,527,,0,0,1,710,0,1,0


In [None]:
new_X_train.drop(['SibSp','Parch'],axis=1,inplace=True)
new_X_test.drop(['SibSp','Parch'],axis=1,inplace=True)

In [None]:
new_X_test,new_X_train

(     index   Age  Pclass  index  sex_male  sex_female  family
 0      707  42.0       1      0         0           1       0
 1       37  21.0       3      1         0           1       0
 2      615  24.0       2      2         1           0       3
 3      169  28.0       3      3         0           1       0
 4       68  17.0       3      4         1           0       6
 ..     ...   ...     ...    ...       ...         ...     ...
 174     89  24.0       3    174         0           1       0
 175     80  22.0       3    175         0           1       0
 176    846   NaN       3    176         0           1      10
 177    870  26.0       3    177         0           1       0
 178    251  29.0       3    178         1           0       2
 
 [179 rows x 7 columns],
      index   Age  Pclass  index  sex_male  sex_female  family
 0       30  40.0       1      0         0           1       0
 1       10   4.0       3      1         1           0       2
 2      873  47.0       3   

In [None]:
model2=DecisionTreeClassifier()
model2.fit(new_X_train,y_train)

In [None]:
y_pred2=model2.predict(new_X_test)
y_pred2

array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0,
       0, 1, 0])

In [None]:
accuracy_score(y_pred2,y_test)

0.7206703910614525

In [None]:
cross_val_score(model2,new_X_train,y_train,cv=10,scoring='accuracy').mean()

np.float64(0.6800078247261346)

##More feature Extraction

In [None]:
df.head()

Unnamed: 0,Name,Age,Sex,SibSp,Parch,Pclass,Survived
0,"Braund, Mr. Owen Harris",22.0,male,1,0,3,0
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,female,1,0,1,1
2,"Heikkinen, Miss. Laina",26.0,female,0,0,3,1
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,female,1,0,1,1
4,"Allen, Mr. William Henry",35.0,male,0,0,3,0


In [None]:
df['nickname']=df['Name'].str.split(',').str[0]
df.head()

Unnamed: 0,Name,Age,Sex,SibSp,Parch,Pclass,Survived,nickname
0,"Braund, Mr. Owen Harris",22.0,male,1,0,3,0,Braund
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,female,1,0,1,1,Cumings
2,"Heikkinen, Miss. Laina",26.0,female,0,0,3,1,Heikkinen
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,female,1,0,1,1,Futrelle
4,"Allen, Mr. William Henry",35.0,male,0,0,3,0,Allen


In [None]:
df['name_extension']=df['Name'].str.split(',').str[1].str.split('.').str[0].str.strip()
df.head()

Unnamed: 0,Name,Age,Sex,SibSp,Parch,Pclass,Survived,nickname,name_extension
0,"Braund, Mr. Owen Harris",22.0,male,1,0,3,0,Braund,Mr
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,female,1,0,1,1,Cumings,Mrs
2,"Heikkinen, Miss. Laina",26.0,female,0,0,3,1,Heikkinen,Miss
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,female,1,0,1,1,Futrelle,Mrs
4,"Allen, Mr. William Henry",35.0,male,0,0,3,0,Allen,Mr


In [None]:
df['name_extension'].unique()
#ata akta important feature hote pare (Catagorical Column)

array(['Mr', 'Mrs', 'Miss', 'Master', 'Don', 'Rev', 'Dr', 'Mme', 'Ms',
       'Major', 'Lady', 'Sir', 'Mlle', 'Col', 'Capt', 'the Countess',
       'Jonkheer'], dtype=object)

In [None]:
df.groupby('name_extension').get_group('Miss')
#unmarid feamile

Unnamed: 0,Name,Age,Sex,SibSp,Parch,Pclass,Survived,nickname,name_extension
2,"Heikkinen, Miss. Laina",26.0,female,0,0,3,1,Heikkinen,Miss
10,"Sandstrom, Miss. Marguerite Rut",4.0,female,1,1,3,1,Sandstrom,Miss
11,"Bonnell, Miss. Elizabeth",58.0,female,0,0,1,1,Bonnell,Miss
14,"Vestrom, Miss. Hulda Amanda Adolfina",14.0,female,0,0,3,0,Vestrom,Miss
22,"McGowan, Miss. Anna ""Annie""",15.0,female,0,0,3,1,McGowan,Miss
...,...,...,...,...,...,...,...,...,...
866,"Duran y More, Miss. Asuncion",27.0,female,1,0,2,1,Duran y More,Miss
875,"Najib, Miss. Adele Kiamie ""Jane""",15.0,female,0,0,3,1,Najib,Miss
882,"Dahlberg, Miss. Gerda Ulrika",22.0,female,0,0,3,0,Dahlberg,Miss
887,"Graham, Miss. Margaret Edith",19.0,female,0,0,1,1,Graham,Miss
