### Assignment-based Subjective Questions

In [1]:
#Q1)From your analysis of the categorical variables from the dataset, what could you infer about 
# their effect on the dependent variable?       

### ANSWER ####
'''
To analyze the effect of categorical variables on the dependent variable, we can break it down into a few key steps. Typically, this is done 
using exploratory data analysis (EDA) and statistical techniques like cross-tabulation, chi-square tests, and visualizations 
(such as bar charts or box plots) to determine relationships between the categorical variables and the dependent variable.

Here's a general approach to analyzing the effect of categorical variables:

1. Cross-Tabulation
   - Method: For each categorical variable, create a cross-tabulation or contingency table to observe how the values of the categorical 
    variables are distributed across the dependent variable categories.
   - Inference: This helps in identifying patterns or imbalances. For example, if you are looking at default status (dependent variable), you can see if certain categories (e.g., "employment type") have higher rates of default.

### 2. Chi-Square Test of Independence
   - Method: Perform a chi-square test to assess if there is a statistically significant relationship between a categorical variable and the dependent variable.
   - Inference: If the test shows a significant result (p-value < 0.05), this suggests that the categorical variable might influence the dependent variable. For example, loan type could significantly affect the likelihood of default.

### 3. Bar Charts and Count Plots
   - Method: Use bar charts or count plots to visualize the distribution of the dependent variable across different categories.
   - Inference: Visualizing the distribution can give a clear idea of the relationship. For instance, if a particular category (e.g., "marital status") leads to a significantly higher proportion of defaults, it indicates that the category may be a key driver of the dependent variable.

### 4. Categorical Encoding (One-Hot or Label Encoding)
   - Method: Encode categorical variables and then use logistic regression or decision trees to assess the effect of these encoded variables on the dependent variable.
   - Inference: The importance of each categorical feature can be gauged from the model’s coefficients (in regression) or feature importance scores (in decision trees or random forests).

### 5. Box Plots (for ordinal categories)
   - Method: For ordinal categorical variables, box plots can show how the dependent variable (if continuous) varies across different categories.
   - Inference: If there is a clear trend or difference in the distribution across categories, it suggests an influence on the dependent variable. For example, income brackets might show a clear difference in loan default likelihood.

### Sample Inference:
- If you have education level as a categorical variable and loan default as the dependent variable, you might find that people with higher education levels tend to default less. This would indicate that education level has a negative correlation with loan default.

The exact inference depends on your dataset and the specific variables in question, but these are general approaches used to infer the effect of categorical variables on a dependent variable.
'''

'\nTo analyze the effect of categorical variables on the dependent variable, we can break it down into a few key steps. Typically, this is done \nusing exploratory data analysis (EDA) and statistical techniques like cross-tabulation, chi-square tests, and visualizations \n(such as bar charts or box plots) to determine relationships between the categorical variables and the dependent variable.\n\nHere\'s a general approach to analyzing the effect of categorical variables:\n\n1. Cross-Tabulation\n   - Method: For each categorical variable, create a cross-tabulation or contingency table to observe how the values of the categorical \n    variables are distributed across the dependent variable categories.\n   - Inference: This helps in identifying patterns or imbalances. For example, if you are looking at default status (dependent variable), you can see if certain categories (e.g., "employment type") have higher rates of default.\n\n### 2. Chi-Square Test of Independence\n   - Method: Perfo

In [3]:
#Q2) Why is it important to use drop_first=True during dummy variable creation?
### ANSWER ####

'''
Using `drop_first=True` in `get_dummies()` is important in dummy variable creation to prevent multicollinearity when performing 
regression analysis or machine learning algorithms. Here's why:

1. Multicollinearity:
   - When you have multiple categories for a categorical variable (e.g., "red," "blue," "green" for color), `get_dummies()` will create a 
   separate dummy variable for each category.
   - If all categories are represented by dummy variables, one of the dummies can always be perfectly predicted by the others. This results 
   in perfect multicollinearity, where one variable is a linear combination of others.
   - Multicollinearity can distort statistical tests and make it difficult to interpret the coefficients in regression models.

2. Redundant Information:
   - When you include all dummy variables, one is redundant. For example, if a categorical variable has three levels (A, B, C), 
   creating three dummies means if you know two of the dummy values, you can infer the third.
   - Example: If a variable takes the values A, B, or C, and you create three dummy variables:
     - A: [1, 0, 0]
     - B: [0, 1, 0]
     - C: [0, 0, 1]
   - The third column can be inferred if you have the first two, leading to redundancy.

3. drop_first=True:
   - When `drop_first=True`, pandas drops the first dummy variable and only creates (k-1) dummy variables for k categories.
   - This removes the redundancy and solves the multicollinearity issue.
   - The dropped category is treated as a reference category, and the remaining dummies represent how the other categories differ 
   from that reference.
'''

import pandas as pd

# Example DataFrame
df = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'blue', 'green']
})

# Without drop_first
print(pd.get_dummies(df))

# With drop_first
print(pd.get_dummies(df, drop_first=True))

#Output without `drop_first=True`:
'''
|   | color_blue | color_green | color_red |
|---|------------|-------------|-----------|
| 0 | 0          | 0           | 1         |
| 1 | 1          | 0           | 0         |
| 2 | 0          | 1           | 0         |
| 3 | 1          | 0           | 0         |
| 4 | 0          | 1           | 0         |

#Output with `drop_first=True`:

|   | color_blue | color_green |
|---|------------|-------------|
| 0 | 0          | 0           |
| 1 | 1          | 0           |
| 2 | 0          | 1           |
| 3 | 1          | 0           |
| 4 | 0          | 1           |

 In this case, "red" is the reference category, and the remaining dummies represent how "blue" and "green" differ from "red."

 ### Conclusion:
 Using `drop_first=True` simplifies your model by avoiding redundant information and prevents multicollinearity, improving model 
 interpretability and efficiency.

'''

   color_blue  color_green  color_red
0       False        False       True
1        True        False      False
2       False         True      False
3        True        False      False
4       False         True      False
   color_green  color_red
0        False       True
1        False      False
2         True      False
3        False      False
4         True      False


'\n|   | color_blue | color_green | color_red |\n|---|------------|-------------|-----------|\n| 0 | 0          | 0           | 1         |\n| 1 | 1          | 0           | 0         |\n| 2 | 0          | 1           | 0         |\n| 3 | 1          | 0           | 0         |\n| 4 | 0          | 1           | 0         |\n\n#Output with `drop_first=True`:\n\n|   | color_blue | color_green |\n|---|------------|-------------|\n| 0 | 0          | 0           |\n| 1 | 1          | 0           |\n| 2 | 0          | 1           |\n| 3 | 1          | 0           |\n| 4 | 0          | 1           |\n\n In this case, "red" is the reference category, and the remaining dummies represent how "blue" and "green" differ from "red."\n\n ### Conclusion:\n Using `drop_first=True` simplifies your model by avoiding redundant information and prevents multicollinearity, improving model \n interpretability and efficiency.\n\n'

In [4]:
# 3. Looking at the pair-plot among the numerical variables, which one has the highest correlation 
#with the target variable? 
# ANswer
'''
 Based on the pair-plot you provided, the relationship between the variables can be observed visually. The target variable seems to be 
 "cnt" (likely representing a count of some event). From the scatter plots:

- The variable "temp" (temperature) shows the strongest positive linear relationship with the target "cnt." This can be seen from the 
diagonal pattern in the scatter plot between "cnt" and "temp." 
- Similarly, the variable "atemp" (which may represent apparent temperature) also shows a strong correlation with "cnt."

Among these two, "temp" appears to have the strongest correlation with the target, based on visual inspection.
'''

'\n Based on the pair-plot you provided, the relationship between the variables can be observed visually. The target variable seems to be \n "cnt" (likely representing a count of some event). From the scatter plots:\n\n- The variable **"temp" (temperature)** shows the strongest positive linear relationship with the target "cnt." This can be seen from the \ndiagonal pattern in the scatter plot between "cnt" and "temp." \n- Similarly, the variable "atemp" (which may represent apparent temperature) also shows a strong correlation with "cnt."\n\nAmong these two, "temp" appears to have the strongest correlation with the target, based on visual inspection.\n'

In [None]:
# Q5) How did you validate the assumptions of Linear Regression after building the model on the 
# training set?

### ANSWER 
'''
After building a Linear Regression model, its crucial to validate the assumptions of the model to ensure its accuracy and generalizability. The following are the key assumptions of Linear Regression and common techniques to validate them:

### 1. Linearity of the relationship between features and target:
   - Assumption: The dependent variable (target) should have a linear relationship with each independent variable (features).
   - How to validate:
     - Residual Plot: Plot residuals (difference between observed and predicted values) versus the predicted values. The residuals should be randomly scattered around zero, without any distinct patterns (e.g., curved or funnel-shaped).
     - Scatter Plot: Plot the features against the target variable to visually check for linear relationships.
     - Partial Regression Plots: These help to visualize the effect of each predictor on the target while keeping other variables constant.
   
   Corrective Action: Apply transformations (e.g., log, polynomial features) if the relationships are non-linear.

### 2. Homoscedasticity (constant variance of errors):
   - Assumption: The variance of the residuals should remain constant across all levels of predicted values.
   - How to validate:
     - Residual Plot: Look for patterns in the residual plot. If the residuals show a "funnel" shape (i.e., the variance increases or decreases as the predicted values increase), it indicates heteroscedasticity.
   
   Corrective Action: Apply transformations to the dependent variable (e.g., log transformation) or use models that can handle heteroscedasticity (e.g., Generalized Least Squares).

### 3. Independence of errors:
   - Assumption: The residuals (errors) should be independent of each other, meaning there is no autocorrelation.
   - How to validate:
     - Durbin-Watson Test: This statistical test detects the presence of autocorrelation in the residuals. A value close to 2 indicates no autocorrelation, while values close to 0 or 4 suggest positive or negative autocorrelation, respectively.
     - Plot Residuals over Time: If your data is time-based (e.g., time series), plot the residuals over time to detect any patterns (which indicate dependence).
   
   Corrective Action: If autocorrelation is present, you might need to use time-series-specific techniques such as ARIMA models.

### 4. Normality of residuals:
   - Assumption: The residuals should be normally distributed.
   - How to validate:
     - Histogram or Q-Q Plot: Plot a histogram of the residuals or use a Q-Q plot (Quantile-Quantile plot) to check if the residuals follow a normal distribution. In a Q-Q plot, the points should fall along the 45-degree reference line if the residuals are normally distributed.
     - Shapiro-Wilk Test: A formal statistical test for normality. However, this test can be overly sensitive in large datasets, so visual inspections are often more practical.
   
   Corrective Action: If the residuals are not normally distributed, you may need to apply transformations to the target variable (e.g., log transformation).

### 5. No multicollinearity among independent variables:
   - Assumption: The independent variables should not be highly correlated with each other.
   - How to validate:
     - Variance Inflation Factor (VIF): Calculate VIF for each independent variable. A VIF > 10 suggests a high level of multicollinearity.
     - Correlation Matrix: Check the pairwise correlation among features. High correlation values (e.g., > 0.8 or < -0.8) between two or more features indicate multicollinearity.
   
   Corrective Action: Remove or combine highly correlated features, or use techniques like Principal Component Analysis (PCA) or Ridge Regression to handle multicollinearity.

### 6. Outliers and Influential Points:
   - Assumption: Outliers and high-leverage points (influential observations) can distort the model’s performance.
   - How to validate:
     - Leverage vs. Residuals Plot: Check for points that have high leverage and large residuals.
     - Cook’s Distance: This statistic identifies influential points. Values greater than 4/n (where n is the number of data points) are typically considered influential.
     - Boxplots: Use boxplots to visually inspect outliers in the independent variables.
   
   Corrective Action: Investigate and, if appropriate, remove or transform the outliers. Alternatively, you can use robust regression techniques that are less sensitive to outliers.

### 7. No Omitted Variable Bias:
   - Assumption: All relevant variables are included in the model.
   - How to validate:
     - Domain Knowledge: Ensure that you include all significant predictors based on your understanding of the problem.
     - Comparison of Models: Compare the performance of different models, with and without suspected omitted variables, to see if their inclusion significantly improves performance.

### Conclusion:
Validating these assumptions ensures that your Linear Regression model is reliable and generalizes well to unseen data. If any assumptions are violated, applying the appropriate transformations or using alternative models will help improve your model's robustness and accuracy.

'''

In [5]:
# Based on the final model, which are the top 3 features contributing significantly towards 
# explaining the demand of the shared bikes?

# ANSWER 
'''
variables year , season/ weather situation and month are significant in predicting the demand for shared bikes .
'''

'\nvariables year , season/ weather situation and month are significant in predicting the demand for shared bikes .\n'