## Feature Construction

**This is total based on your experience, no mathematics**

In [1]:
import pandas as pd
import numpy as np


In [11]:
df = pd.read_csv('https://raw.githubusercontent.com/campusx-official/100-days-of-machine-learning/refs/heads/main/day45-feature-construction-and-feature-splitting/train.csv')

In [12]:
df = df[['Age', 'Pclass', 'SibSp', 'Parch', 'Survived']]
df.head()

Unnamed: 0,Age,Pclass,SibSp,Parch,Survived
0,22.0,3,1,0,0
1,38.0,1,1,0,1
2,26.0,3,0,0,1
3,35.0,1,1,0,1
4,35.0,3,0,0,0


In [13]:
df = df.dropna()
X  =df.drop(columns=['Survived'])
y = df['Survived']

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
np.mean(cross_val_score(LogisticRegression(), X, y, cv=20, scoring='accuracy'))

np.float64(0.6933333333333332)

### Applying Feature Construction 

In [21]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# Create family_members column
df['family_members'] = df['SibSp'] + df['Parch'] + 1

# Handle potential null values (fill with 1 assuming person is alone if data missing)
df['family_members'] = df['family_members'].fillna(1).astype(int)  # Convert to int after fill

def family_size_count(x):
    try:
        x = int(x)
        if x == 1:
            return 'Alone'
        elif 2 <= x <= 4:
            return 'Small'
        else:
            return 'Large'
    except (ValueError, TypeError):
        return 'Alone'

# Apply the function
df['family_size'] = df['family_members'].apply(family_size_count)

# Convert categorical variables to dummy/indicator variables
X_new = pd.get_dummies(df.drop(columns=['Survived']), drop_first=True)
y_new = df['Survived']

# Calculate cross-validation score
logreg = LogisticRegression(max_iter=1000)  # Increased max_iter for convergence
cv_scores = cross_val_score(logreg, X_new, y_new, cv=20, scoring='accuracy')
print(f"Mean Accuracy: {np.mean(cv_scores):.4f}")
print(f"Standard Deviation: {np.std(cv_scores):.4f}")

Mean Accuracy: 0.7326
Standard Deviation: 0.0863


## Curse of Dimensionality in Machine Learning

### Definition
The "curse of dimensionality" refers to various challenges that arise when working with high-dimensional data (many features/variables), where the performance of machine learning algorithms often degrades as the number of dimensions increases.

### Key Problems

#### 1. Data Sparsity
- In high dimensions, data points become increasingly isolated
- Volume of space grows exponentially with dimensions, making data sparse
- Requires exponentially more data to maintain density

#### 2. Distance Measures Become Meaningless
- All pairwise distances converge to the same value
- Distinction between "near" and "far" points diminishes
- Affects distance-based algorithms (k-NN, clustering)

#### 3. Overfitting Risk Increases
- Model complexity grows with dimensionality
- More parameters needed → higher risk of overfitting
- Need for more training data grows exponentially

#### 4. Computational Challenges
- Higher memory and processing requirements
- Algorithms become slower (often non-linearly)

### Common Solutions

#### Dimensionality Reduction
- **Feature Selection**: Choose most relevant features
- **Feature Extraction**: PCA, t-SNE, UMAP, autoencoders

#### Regularization
- L1/L2 regularization to prevent overfitting
- Dropout in neural networks

#### Alternative Algorithms
- Tree-based methods often handle high dimensions better
- Use of manifolds or specialized high-dim algorithms

### Visualization Challenge
- Human intuition fails beyond 3D
- Projection techniques needed to visualize high-D data

### Notable Quotes
> "In high dimensions, all data is sparse." - Richard Bellman (coined the term)
> "The curse of dimensionality is the bane of machine learning." - Common saying

## <span style="color: #00ddffff;"><i>PCA</i></span>

- Is a technique in which a higher dimensional data can be converted to a lower dimensional data while keeping the quality of the data unchanged>

- Can convert a multidimensiional data into a 3D so that it can be visible 

**Cases where you had to pick <mark>one feature out of two</mark>**
* There you can plot the graph between both the features and check whose spread is wider in it's axis and then keep that feature. 

**<i>For cases wehre both the features gets equal spread there we use the concept of <mark>feature extraction</mark></i>**