# Week4 - Correlation and Regression

## Model the data

Once you've prepared the data, your next task is to analyze it to get insights that are not immediately obvious.

In this section, you'll learn how to:

- Calculate correlations, regressions, forecasts, and outliers using **spreadsheets**
- Aggregate and pivot data using **Python** and **databases**.

**NOTE:** This module earlier covered machine learning techniques such as classification and clustering. These are removed in this version.

## Correlation with Excel

You'll learn to calculate and interpret correlations using Excel, covering:

- **Enabling the Data Analysis Tool Pack:** Steps to enable the Excel data analysis tool pack.
- **Correlation Analysis:** Understanding statistical association between variables.
- **Creating a Correlation Matrix:** Steps to generate and interpret a correlation matrix.
- **Scatterplots and Trendlines:** Plotting data and adding trend lines to visualize correlations.
- **Analyzing Results:** Comparing correlation coefficients and understanding their implications.
- **Insights and Further Analysis:** Interpreting scatterplots and planning further analysis for deeper insights.

Here are the links used in the video:

- [Understand correlation](https://www.khanacademy.org/math/ap-statistics/bivariate-data-ap/correlation-coefficient-r/v/correlation-coefficient-intuition-examples)
- [COVID-19 vaccinations data explorer - Website](https://ourworldindata.org/covid-vaccinations?country=OWID_WRL)
- [COVID-19 vaccinations - Correlations Excel file](./week4_downloads/Correlation_2.xlsx)

## Regression with Excel

You'll learn to perform regression analysis using Excel, covering:

- **Data Preparation:** Understanding the cleaned dataset and necessary columns for analysis.
- **Enabling the Tool:** How to enable the Data Analysis Tool Pack in Excel.
- **Types of Regression:** Differences between simple and multiple linear regression.
- **Setting Up Regression:** Steps to input dependent (new deaths) and independent variables (new cases, new tests, new vaccinations, stringency index) for the analysis.
- **Interpreting Output:** Reading the regression output, focusing on adjusted R-squared, significance value (F-test), and P-values.
- **Coefficient Interpretation:** Understanding the impact of each independent variable on the dependent variable, including scaling factors (per 1000 units).
- **Model Evaluation:** Evaluating the model based on significance values and understanding the implications of unexpected results (e.g., stringency index).
- **Further Analysis:** Recognizing the need for additional analysis when encountering unexpected or inconclusive results.

Here are the links used in the video:

- [Understand regression](https://www.khanacademy.org/math/ap-statistics/bivariate-data-ap/least-squares-regression/v/calculating-the-equation-of-a-regression-line)
- [COVID-19 vaccinations - Regression Excel file](./week4_downloads/Regression.xlsx)
-[COVID-19 vaccinations - Regression Model 2 Excel file](./week4_downloads/Regression_3.xlsx)

## Forecasting with Excel

Here are links used in the video:

- [FORECAST reference](https://support.microsoft.com/en-us/office/forecast-and-forecast-linear-functions-50ca49c9-7b40-4892-94e4-7ad38bbeda99)
- [FORECAST.ETS reference](https://support.microsoft.com/en-us/office/forecast-ets-forecasting-technique-9d6b3e4b-6b6c-4d4d-8b7e-9e7a9b5b6a0e)
- [Height-weight dataset](./week4_downloads/height-weight.xlsx) from [Kaggle](https://www.kaggle.com/datasets/burnoutminer/heights-and-weights-dataset)
- [Traffic dataset](./week4_downloads/traffic.xlsx) from [Kaggle](https://www.kaggle.com/datasets/fedesoriano/traffic-prediction-dataset)

## Outlier detection with Excel

You'll learn how to identify and handle outliers in data using Excel, covering:

- **Understanding Outliers:** Definition of outliers and their impact on statistical analysis.
- **Calculating Quartiles:** Using Excel formulas to calculate Q1 (first quartile) and Q3 (third quartile).
- **Interquartile Range (IQR):** Finding the IQR by subtracting Q1 from Q3.
- **Determining Bounds:** Calculating lower and upper bounds using 1.5 times the IQR.
- **Identifying Outliers:** Using Excel functions to determine if data points fall outside the calculated bounds.
- **Visualizing Data:** Creating box plots to visualize outliers and data distribution.
- **Handling Outliers:** Deciding whether to exclude or keep outliers based on their impact on analysis.

Here are the links used in the video:

- [Understand distributions and outliers](https://www.khanacademy.org/math/ap-statistics/quantitative-data-ap/xfb5d8e68:describing-distribution-quant/v/classifying-distributions-and-outliers)
- [COVID-19 vaccinations data - Excel](./week4_downloads/correlation2.xlsx)

## Data analysis with Python

The video is not available yet. Please review the notebook, which is self-explanatory. #TODO

You'll learn practical data analysis techniques in Python using Pandas, covering:

- **Reading Parquet Files:** Utilize Pandas to read Parquet file formats for efficient data handling.
- **Dataframe Inspection:** Methods to preview and understand the structure of a dataset.
- **Pivot Tables:** Creating and interpreting pivot tables to summarize data.
- **Percentage Calculations:** Normalize pivot table values to percentages for better insights.
- **Correlation Analysis:** Calculate and interpret correlation between variables, including significance testing.
Statistical Significance: Use statistical tests to determine the significance of observed correlations.
- **Datetime Handling:** Extract and manipulate date and time information from datetime columns.
- **Data Visualization:** Generate and customize heat maps to visualize data patterns effectively.
- **Leveraging AI:** Use ChatGPT to generate and refine analytical code, enhancing productivity and accuracy.

Here are the links used in the video:

- [Data analysis with Python - Notebook]()
- [Card transactions dataset (Parquet)](./week4_downloads/card-transactions.parquet)
- [10 minutes to Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html)
- [Python Pandas tutorials](https://www.youtube.com/playlist?list=PL-osiE80TeTsWmV9i9c58mdDCSskIFdDS)

## Data analysis with databases

The video is not available yet. Please review the notebook, which is self-explanatory. #TODO

You'll learn how to perform data analysis using SQL (via Python), covering:

- **Database Connection:** How to connect to a MySQL database using SQLAlchemy and Pandas.
- **SQL Queries:** Execute SQL queries directly from a Python environment to retrieve and analyze data.
- **Counting Rows:** Use SQL to count the number of rows in a table.
User Activity Analysis: Query and identify top users by post count.
- **Post Concentration:** Determine if a small percentage of users contribute the majority of posts using SQL aggregation.
- **Correlation Calculation:** Calculate the Pearson correlation coefficient between user attributes such as age and reputation.
- **Regression Analysis:** Compute the regression slope to understand the relationship between views and reputation.
- **Handling Large Data:** Perform calculations on large datasets by fetching aggregated values from the database rather than entire datasets.
- **Statistical Analysis in SQL:** Use SQL as a tool for statistical analysis, demonstrating its power beyond simple data retrieval.
- **Leveraging AI:** Use ChatGPT to generate SQL queries and Python code, enhancing productivity and accuracy.

Here are the links used in the video:

- [Data analysis with databases - Notebook](./data_analysis_with_databases.ipynb)
- [SQLZoo](https://www.sqlzoo.net/wiki/SQL_Tutorial) has [simple interactive tutorials to learn SQL](https://relational-data.org/dataset/Stats)
- [Stats database](https://stats.stackexchange.com/) that has an anonymized dump of stats.stackexchange.com
- [Pandas read_sql](https://pandas.pydata.org/docs/reference/api/pandas.read_sql.html) 
- [SQLAlchemy docs](https://docs.sqlalchemy.org/en/20/)

## Optional: Visualizing Machine Learning

You'll learn about improving customer retention, understanding black box models, and using clustering for market segmentation:

- **Churn Reduction**: Use decision trees to identify customers likely to leave.
- **Cost Efficiency:** Compare customer acquisition vs. retention costs.
- **Model Improvement:** Apply SVMs and neural networks for better accuracy.
- **Project Challenges:** Understand issues with black box models in implementation.
- **K-Means Clustering:** Segment markets using demographic data.
- **Data Visualization:** Interpret clustering results using maps and charts.
- **Correlation Analysis:** Identify relationships between currency exchange rates.
- **Tool Proficiency:** Utilize Excel, Python, and JavaScript for analysis and communication.
- **Practical Application:** Tailor marketing strategies based on cluster characteristics.

Here are the links used in the video:

- [Visualizing-Forecast-Models.xlsx](./week4_downloads/Visualising-Forecast-Models.xlsx) - the spreadsheet used in the video