For performing Exploratory Data Analysis (EDA) for a regression task in Python, you typically follow these steps to understand the relationships, distributions, and patterns within the data. Below is a list of some basic EDA techniques using Python for regression problems:

### 1. **Loading and Inspecting Data**
   - **Read dataset:** `pd.read_csv()` (or other relevant function depending on the data format).
   - **Check shape of the dataset:** `df.shape`
   - **Display first few rows:** `df.head()`
   - **Check data types:** `df.info()`
   - **Check for missing values:** `df.isnull().sum()`
   - **Summary statistics:** `df.describe()`

### 2. **Data Preprocessing**
   - **Handling missing values:** `df.fillna()` or `df.dropna()`
   - **Handling outliers:** Use visualization or statistical methods (e.g., z-score, IQR).
   - **Encoding categorical variables:** `pd.get_dummies()` for one-hot encoding.
   - **Feature scaling:** Standardize or normalize numerical features using `StandardScaler` or `MinMaxScaler`.

### 3. **Univariate Analysis (Single Variable)**
   - **Histogram for numerical features:** 
     ```python
     df['column_name'].hist(bins=30)
     ```
   - **Boxplot to visualize distribution and outliers:**
     ```python
     sns.boxplot(x=df['column_name'])
     ```
   - **Count plot for categorical variables:**
     ```python
     sns.countplot(x='categorical_column', data=df)
     ```
   - **Descriptive statistics for numerical variables:**
     ```python
     df['column_name'].describe()
     ```

### 4. **Bivariate Analysis (Two Variables)**
   - **Scatter plot to visualize relationship between target and features:**
     ```python
     sns.scatterplot(x='feature_column', y='target_column', data=df)
     ```
   - **Correlation matrix for numerical features:**
     ```python
     correlation_matrix = df.corr()
     sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
     ```
   - **Pair plot to see pairwise relationships:**
     ```python
     sns.pairplot(df)
     ```
   - **Box plot between categorical features and target variable (if applicable):**
     ```python
     sns.boxplot(x='categorical_feature', y='target_column', data=df)
     ```

### 5. **Multivariate Analysis (Multiple Variables)**
   - **3D scatter plot (if you have 3 continuous features):**
     ```python
     from mpl_toolkits.mplot3d import Axes3D
     fig = plt.figure()
     ax = fig.add_subplot(111, projection='3d')
     ax.scatter(df['feature1'], df['feature2'], df['target'])
     ```
   - **Heatmap of the correlation matrix** (as shown in Bivariate Analysis but for a larger set of variables).
   - **Principal Component Analysis (PCA) to reduce dimensionality and visualize the data:**
     ```python
     from sklearn.decomposition import PCA
     pca = PCA(n_components=2)
     principal_components = pca.fit_transform(df[['feature1', 'feature2', 'feature3']])
     pca_df = pd.DataFrame(data=principal_components, columns=['PCA1', 'PCA2'])
     sns.scatterplot(x='PCA1', y='PCA2', data=pca_df)
     ```

### 6. **Visualizing Relationships and Checking Assumptions**
   - **Linear regression plot (to visualize linearity assumption):**
     ```python
     sns.regplot(x='feature_column', y='target_column', data=df)
     ```
   - **Residual plot (check homoscedasticity assumption):**
     ```python
     sns.residplot(x='feature_column', y='target_column', data=df)
     ```
   - **Plotting residuals:**
     ```python
     residuals = y - model.predict(X)
     sns.histplot(residuals, kde=True)
     ```

### 7. **Multicollinearity Check**
   - **Variance Inflation Factor (VIF):**
     ```python
     from statsmodels.stats.outliers_influence import variance_inflation_factor
     vif_data = pd.DataFrame()
     vif_data["Feature"] = df.columns
     vif_data["VIF"] = [variance_inflation_factor(df.values, i) for i in range(len(df.columns))]
     print(vif_data)
     ```

### 8. **Checking Normality**
   - **Normality of the target variable:**
     ```python
     sns.histplot(df['target_column'], kde=True)
     ```
   - **Q-Q plot (Quantile-Quantile plot) for normality:**
     ```python
     import scipy.stats as stats
     stats.probplot(df['target_column'], dist="norm", plot=plt)
     ```

### 9. **Handling Categorical Data (if applicable)**
   - **Bar plot for categorical vs target:**
     ```python
     sns.barplot(x='categorical_feature', y='target_column', data=df)
     ```
   - **One-hot encoding:**
     ```python
     df = pd.get_dummies(df, columns=['categorical_feature'])
     ```

---

These techniques will give you a good understanding of your dataset and its suitability for regression modeling. After performing these EDA steps, you can make informed decisions about preprocessing, feature engineering, and which model to apply.

## Reg Metrics

1. **Mean Absolute Error (MAE)**  
   - Measures the average of absolute differences between predicted and actual values.  
   - Useful when you want a simple metric and care equally about all errors.  
   - Lower MAE indicates better predictive accuracy.

2. **Mean Squared Error (MSE)**  
   - Measures the average of squared differences between predicted and actual values.  
   - Preferred when larger errors are particularly undesirable.  
   - Lower MSE signifies better performance, with more penalty for large errors.

3. **Root Mean Squared Error (RMSE)**  
   - Square root of MSE, providing error in the original unit of measurement.  
   - Useful when you want to interpret the error in the same units as the data.  
   - Lower RMSE means better model performance, with an emphasis on large errors.

4. **R-squared (R²)**  
   - Represents the proportion of variance in the dependent variable explained by the model.  
   - Good for assessing overall model fit and how well the model captures variability.  
   - Values closer to 1 indicate better fit; 0 means no explanatory power.

5. **Adjusted R-squared**  
   - R² adjusted for the number of predictors, penalizing excessive variables.  
   - Use when comparing models with different numbers of predictors.  
   - Higher values indicate a better fit, adjusted for model complexity.

6. **Mean Absolute Percentage Error (MAPE)**  
   - Measures the average percentage difference between predicted and actual values.  
   - Useful when you need an error metric that is independent of scale and easily interpretable.  
   - Lower MAPE indicates better model performance, with less bias toward large/small values.

7. **Explained Variance Score**  
   - Indicates the proportion of the variance explained by the model’s predictions.  
   - Good for evaluating how much the model explains the variation in the data.  
   - Values closer to 1 show better prediction and variance capture.

### The Goal
You want to reshape a **1D array** (like a Pandas Series or a 1D NumPy array) into a **2D array** with one column. The reason you're using `-1, 1` is to tell NumPy how to reshape the array in a flexible but specific way.

### The 1D Array
Suppose you have a 1D array, such as:
```python
X_train = [1, 2, 3, 4, 5]
```

This is a 1D array with 5 elements. Its shape is `(5,)`, meaning it has 5 values but only one row.

### Why `reshape(-1, 1)`?

When you use `reshape(-1, 1)`, you are telling NumPy the following:

- **`-1`**: "I don't want to specify how many rows I want. Let NumPy figure that out for me based on the length of the array."
- **`1`**: "I want **1 column**. Each value in my original array should be placed into its own row, but all in one column."

#### What Happens?
Let's break it down with an example:

1. **Before reshaping**: 
   - Original array: `[1, 2, 3, 4, 5]`
   - Shape: `(5,)` (5 elements in a single row)

2. **Reshaped with `.reshape(-1, 1)`**:
   ```python
   X_train = X_train.reshape(-1, 1)
   ```
   - New array:
     ```
     [[1],
      [2],
      [3],
      [4],
      [5]]
     ```
   - Shape: `(5, 1)` (5 rows, 1 column)

So, with `reshape(-1, 1)`, you **convert the 1D array into a 2D array** with each value in its own row, but there is only **1 column**.

### Why Use `-1, 1`?
#### **1. Flexibility with Row Number (`-1`)**
The `-1` means "let NumPy decide the number of rows based on the original array's size." If your array has 10 elements, `reshape(-1, 1)` will give you a shape of `(10, 1)`. If it has 100 elements, it will reshape it to `(100, 1)`.

Without `-1`, you would have to manually count how many rows you need and specify it. `-1` makes this automatic.

#### **2. Standard Format for Machine Learning**
Many machine learning algorithms, especially in libraries like **scikit-learn**, expect the data in a **2D array format**:

- **Rows**: Each row represents a **data point** (one sample).
- **Columns**: Each column represents a **feature** (one attribute of the sample).

For example, if you have 5 data points (samples) and each data point has 1 feature, you need a **5 x 1 array** (5 rows, 1 column).

### Summary:
- **`reshape(-1, 1)`** means "I want to reshape this array into a 2D array with as many rows as needed and 1 column."
- It’s a flexible way to ensure that the data is in the correct format for algorithms that expect 2D input.
