# Data Preprocessing Report

## 1. Data Loading and Exploration

### Dataset: Startup Growth Investment Data

- **Source:** Provided dataset
- **Features:** `Startup Name`, `Industry`, `Investment Amount`, `Growth Rate`

### Data Types & Missing Values

```python
import pandas as pd

startup_df = pd.read_csv(r"C:\Users\lenovo\Downloads\archive\startup_growth_investment_data.csv")

startup_df.info()

startup_df.isnull().sum()
```

#### Observations:

- The dataset contains both numerical and categorical columns.
- Missing values need to be handled.

---

## 2. Handling Missing Values

```python
startup_df['Investment Amount'].fillna(startup_df['Investment Amount'].median(), inplace=True)
startup_df['Growth Rate'].fillna(startup_df['Growth Rate'].median(), inplace=True)

startup_df['Industry'].fillna(startup_df['Industry'].mode()[0], inplace=True)
```

#### Explanation:
- **Median imputation** is used for numerical columns to avoid skewing data.
- **Mode imputation** is applied to categorical columns.

---

## 3. Encoding Categorical Variables

### Identified Categorical Columns:
- `Startup Name` (Not needed for ML models)
- `Industry` (Nominal)

```python
from sklearn.preprocessing import LabelEncoder

# Encode 'Industry' column
label_encoder = LabelEncoder()
startup_df['Industry'] = label_encoder.fit_transform(startup_df['Industry'])
```

#### Explanation:
- **Label Encoding** is used to convert industry names into numerical values.

---

## 4. Feature Scaling

- **Standardization** is applied to numerical columns.

```python
from sklearn.preprocessing import StandardScaler

# Standardize 'Investment Amount' and 'Growth Rate'
scaler = StandardScaler()
startup_df[['Investment Amount', 'Growth Rate']] = scaler.fit_transform(startup_df[['Investment Amount', 'Growth Rate']])
```

#### Explanation:
- **Standardization** ensures all numerical values have zero mean and unit variance for better model performance.

---

## 5. Train-Test Split

```python
from sklearn.model_selection import train_test_split

# Define features and target
X = startup_df[['Industry', 'Investment Amount']]
y = startup_df['Growth Rate']

# Split the dataset into 80% train and 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```

#### Explanation:
- The dataset is split into training (80%) and testing (20%) to evaluate model performance.

---


