In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

In [None]:
**Feature construction** and **feature splitting** are essential techniques in data preprocessing and feature engineering. They are used to create new features from existing data, enhance the predictive power of models, and make data more interpretable.

### Feature Construction

Feature construction involves creating new features from the existing ones to better represent the underlying problem to the machine learning algorithms. This process can help improve model performance by making the patterns in the data more evident.

#### Examples of Feature Construction

1. **Mathematical Transformations**:
    - Creating interaction terms (e.g., product of two features).
    - Applying mathematical functions (e.g., log, square root).

    ```python
    import pandas as pd
    import numpy as np

    # Sample data
    df = pd.DataFrame({
        'height': [1.5, 1.8, 1.6, 1.7],
        'weight': [60, 75, 65, 70]
    })

    # Creating a new feature - Body Mass Index (BMI)
    df['bmi'] = df['weight'] / (df['height'] ** 2)
    print(df)
    ```

2. **Aggregations**:
    - Summarizing data using aggregation functions like mean, sum, etc.

    ```python
    df['height_weight_sum'] = df['height'] + df['weight']
    print(df)
    ```

3. **Date Features**:
    - Extracting features like day, month, year, day of the week, etc., from a date column.

    ```python
    df['date'] = pd.to_datetime(['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04'])
    df['day_of_week'] = df['date'].dt.dayofweek
    print(df)
    ```

### Feature Splitting

Feature splitting involves breaking down a single feature into multiple components. This is often useful when dealing with categorical variables or text data.

#### Examples of Feature Splitting

1. **Splitting Categorical Features**:
    - Splitting a categorical feature into dummy/indicator variables (one-hot encoding).

    ```python
    df = pd.DataFrame({
        'color': ['red', 'blue', 'green', 'blue']
    })

    # One-hot encoding
    df_encoded = pd.get_dummies(df['color'], prefix='color')
    print(df_encoded)
    ```

2. **Splitting Text Features**:
    - Splitting a text feature into multiple components, such as splitting a full name into first and last names.

    ```python
    df = pd.DataFrame({
        'full_name': ['John Doe', 'Jane Smith', 'Alice Johnson']
    })

    # Splitting full_name into first_name and last_name
    df[['first_name', 'last_name']] = df['full_name'].str.split(' ', expand=True)
    print(df)
    ```

### Summary

- **Feature Construction**: Creating new features to enhance model performance by making patterns more evident.
- **Feature Splitting**: Breaking down a single feature into multiple components for better representation.

Both techniques are crucial for effective feature engineering, improving the quality and performance of machine learning models.

### Example Code in a Jupyter Notebook

Here's a full example demonstrating both feature construction and feature splitting:

```python
import pandas as pd
import numpy as np

# Sample data
df = pd.DataFrame({
    'height': [1.5, 1.8, 1.6, 1.7],
    'weight': [60, 75, 65, 70],
    'full_name': ['John Doe', 'Jane Smith', 'Alice Johnson', 'Bob Brown'],
    'date': pd.to_datetime(['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04']),
    'color': ['red', 'blue', 'green', 'blue']
})

# Feature Construction
df['bmi'] = df['weight'] / (df['height'] ** 2)
df['height_weight_sum'] = df['height'] + df['weight']
df['day_of_week'] = df['date'].dt.dayofweek

# Feature Splitting
df[['first_name', 'last_name']] = df['full_name'].str.split(' ', expand=True)
df_encoded = pd.get_dummies(df['color'], prefix='color')

# Concatenate the original dataframe with the one-hot encoded columns
df = pd.concat([df, df_encoded], axis=1)

print(df)
```

This code demonstrates both feature construction (BMI calculation, height and weight sum, day of the week extraction) and feature splitting (splitting full name and one-hot encoding of color).

In [2]:
df = pd.read_csv('train.csv')[['Age','Pclass','SibSp','Parch','Survived']]


In [3]:
df.head()


Unnamed: 0,Age,Pclass,SibSp,Parch,Survived
0,22.0,3,1,0,0
1,38.0,1,1,0,1
2,26.0,3,0,0,1
3,35.0,1,1,0,1
4,35.0,3,0,0,0


In [4]:
df.dropna(inplace=True)


In [5]:
df.head()


Unnamed: 0,Age,Pclass,SibSp,Parch,Survived
0,22.0,3,1,0,0
1,38.0,1,1,0,1
2,26.0,3,0,0,1
3,35.0,1,1,0,1
4,35.0,3,0,0,0


In [6]:
X = df.iloc[:,0:4]
y = df.iloc[:,-1]

In [7]:
X.head()

Unnamed: 0,Age,Pclass,SibSp,Parch
0,22.0,3,1,0
1,38.0,1,1,0
2,26.0,3,0,0
3,35.0,1,1,0
4,35.0,3,0,0


In [8]:
np.mean(cross_val_score(LogisticRegression(),X,y,scoring='accuracy',cv=20))


0.6933333333333332

In [18]:
#applying feature Construction
X['Family_size'] = X['SibSp'] + X['Parch'] + 1

In [21]:
X.head()

Unnamed: 0,Age,Pclass,SibSp,Parch,Family_size
0,22.0,3,1,0,2
1,38.0,1,1,0,2
2,26.0,3,0,0,1
3,35.0,1,1,0,2
4,35.0,3,0,0,1


In [22]:
def myfunc(num):
    if num == 1:
        #alone
        return 0
    elif num >1 and num <=4:
        # small family
        return 1
    else:
        # large family
        return 2

In [23]:
myfunc(4)


1

In [24]:
X['Family_type'] = X['Family_size'].apply(myfunc)

In [25]:
X.head()

Unnamed: 0,Age,Pclass,SibSp,Parch,Family_size,Family_type
0,22.0,3,1,0,2,1
1,38.0,1,1,0,2,1
2,26.0,3,0,0,1,0
3,35.0,1,1,0,2,1
4,35.0,3,0,0,1,0


In [26]:
X.drop(columns=['SibSp','Parch','Family_size'],inplace=True)

In [27]:
X.head()

Unnamed: 0,Age,Pclass,Family_type
0,22.0,3,1
1,38.0,1,1
2,26.0,3,0
3,35.0,1,1
4,35.0,3,0


In [28]:
np.mean(cross_val_score(LogisticRegression(),X,y,scoring='accuracy',cv=20))

0.7003174603174602

In [29]:
#feature splitting
df = pd.read_csv('train.csv')

In [30]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [31]:
df['Name']

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

In [32]:
df['Title'] = df['Name'].str.split(', ', expand=True)[1].str.split('.', expand=True)[0]

In [33]:
df['Name'].str.split(', ', expand=True)[1].str.split('.', expand=True)[0]

0        Mr
1       Mrs
2      Miss
3       Mrs
4        Mr
       ... 
886     Rev
887    Miss
888    Miss
889      Mr
890      Mr
Name: 0, Length: 891, dtype: object

In [37]:
df[['Title','Name']]


Unnamed: 0,Title,Name
0,Mr,"Braund, Mr. Owen Harris"
1,Mrs,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,Miss,"Heikkinen, Miss. Laina"
3,Mrs,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
4,Mr,"Allen, Mr. William Henry"
...,...,...
886,Rev,"Montvila, Rev. Juozas"
887,Miss,"Graham, Miss. Margaret Edith"
888,Miss,"Johnston, Miss. Catherine Helen ""Carrie"""
889,Mr,"Behr, Mr. Karl Howell"


In [41]:
#(df.groupby('Title').mean()['Survived']).sort_values(ascending=False)

In [39]:
df['Is_Married'] = 0
df['Is_Married'].loc[df['Title'] == 'Mrs'] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Is_Married'].loc[df['Title'] == 'Mrs'] = 1


In [40]:
df['Is_Married']


0      0
1      1
2      0
3      1
4      0
      ..
886    0
887    0
888    0
889    0
890    0
Name: Is_Married, Length: 891, dtype: int64