# Advanced Python Data Analysis and Visualization
    
This notebook demonstrates advanced techniques for data analysis and visualization using Python libraries such as `pandas`, `matplotlib`, and `seaborn`. Additionally, it generates a summary report in both Word and Excel formats.

### Dataset
We will be using a sample dataset from the `seaborn` library, which contains information about tips received by a waiter based on various factors.


In [None]:

# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the 'tips' dataset from seaborn
tips = sns.load_dataset('tips')
tips.head()


: 

## Data Cleaning and Exploration
    
1. **Handling Missing Data**: Check for any missing values and handle them accordingly.
2. **Descriptive Statistics**: Provide summary statistics of the dataset.


In [None]:

# Check for missing data
missing_data = tips.isnull().sum()
missing_data


In [None]:

# Summary statistics
summary_stats = tips.describe()
summary_stats


## Data Visualization

### 1. Correlation Heatmap
A heatmap to visualize the correlation between numeric features in the dataset.


In [None]:
# Import the required libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Load the sample dataset 'tips' from seaborn
df = sns.load_dataset('tips')

# Display the first few rows to understand the data structure
print(df.head())

# Select only the numeric columns for correlation
df = df.select_dtypes(include=['number'])


In [None]:
# Import the required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'df' is your DataFrame containing the numeric data
# You can replace this with the actual dataset you're using
# Example: df = pd.read_csv('your_data.csv')

# 1. Correlation Heatmap
plt.figure(figsize=(10, 8))  # Set the figure size for the heatmap
correlation_matrix = df.corr()  # Calculate the correlation matrix

# Generate the heatmap
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)

# Add title and adjust layout
plt.title('Correlation Heatmap')
plt.tight_layout()

# Show the plot
plt.show()


## Analysis of the Heatmap:

### Diagonal (1.0 values):
The diagonal values (from top-left to bottom-right) are all 1.0 because each variable is perfectly correlated with itself.

### `total_bill` and `tip` (0.68):
The correlation between `total_bill` and `tip` is 0.68, indicating a strong positive relationship. This makes sense: as the total bill increases, tips generally increase as well.

### `total_bill` and `size` (0.60):
The correlation between `total_bill` and `size` is 0.60, also a positive correlation. Larger parties tend to have higher bills.

### `tip` and `size` (0.49):
The correlation between `tip` and `size` is 0.49, which is a moderate positive correlation. Larger parties tend to leave higher tips, although the relationship is weaker than the one between `total_bill` and `tip`.


### 2. Boxplot of Tips by Day

A boxplot to show the distribution of tips based on the day of the week.


In [None]:

# Boxplot of tips by day
plt.figure(figsize=(8, 6))
sns.boxplot(x='day', y='tip', data=tips, palette='Set2')
plt.title('Boxplot of Tips by Day')
plt.show()


## Key Elements of the Boxplot:

### Boxes (Interquartile Range - IQR):
The central box represents the interquartile range (IQR), which is the range between the first quartile (25th percentile) and the third quartile (75th percentile).  
The line inside the box represents the **median** (50th percentile) tip for each day.

### Whiskers:
The "whiskers" extend from the box to show the range of the data, typically 1.5 times the IQR from the quartiles.  
Any points outside of this range are considered **outliers** and are plotted as individual points.

### Outliers:
The circles above the whiskers represent **outliers** — unusually high tip values that are more than 1.5 times the IQR away from the rest of the data.

---

## Analysis by Day:

### Thursday (Thur):
- Median tip is around $2.
- The IQR is smaller compared to other days, meaning less variability in tips.
- A few high outliers are present.

### Friday (Fri):
- Median tip is around $2.5.
- The spread of the data (IQR) is slightly wider compared to Thursday.
- Similar to Thursday, there are a few outliers, but not as many as on Saturday.

### Saturday (Sat):
- Median tip is around $2.5, similar to Friday.
- There’s a larger number of **outliers** on Saturday, indicating that some customers left much higher tips on this day.
- The IQR range is similar to Friday.

### Sunday (Sun):
- Median tip is noticeably higher, around $3, indicating that tips are generally larger on Sunday compared to other days.
- The IQR is the largest on Sunday, suggesting the most variability in tip amounts.
- There are fewer outliers compared to Saturday, but tips generally reach higher amounts.


### 3. Scatter Plot with Regression Line

A scatter plot to show the relationship between total bill and tip amount, with a regression line.


In [None]:

# Scatter plot with regression line
plt.figure(figsize=(8, 6))
sns.regplot(x='total_bill', y='tip', data=tips, scatter_kws={'s':20}, line_kws={'color':'red'})
plt.title('Total Bill vs Tip with Regression Line')
plt.show()


## Key Components of the Plot:

### Scatter Points:
- Each blue dot represents a single data point where the x-coordinate is the **total bill** and the y-coordinate is the **tip** for that transaction.
- The points are scattered along the chart, showing how the tip varies with different total bill amounts.

### Regression Line (Red Line):
- The red line represents a **linear regression fit** of the data, showing the general trend in the relationship between the total bill and the tip.
- The positive slope of the line indicates that there is a **positive relationship** between the total bill and the tip — as the total bill increases, the tip generally increases.

### Shaded Area (Confidence Interval):
- The shaded red area around the regression line represents the **confidence interval** for the linear regression model.
- This band shows the uncertainty of the regression line fit; the wider the band, the more uncertainty there is in predicting the exact relationship.

---

## Analysis:

- **Positive Correlation**: The plot clearly shows a positive correlation between the total bill and tip. As the total bill increases, tips also tend to increase, though the data shows a lot of variation.
  
- **Outliers**: There are a few data points where very high tips are given for certain bills, particularly noticeable around the highest total bill values (e.g., total bill of $50 and a tip of $10).

- **Spread of Data**: The scatter points show that tips vary widely for smaller total bills (below $20), while for larger bills, tips are more consistently higher.


## Report Generation

The following cells will generate a summary report in both Word and Excel formats, including data statistics and visualizations.


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from docx import Document  # Import for generating Word documents

# Load the 'tips' dataset from Seaborn
tips = sns.load_dataset('tips')

# Example of defining the missing variables
summary_stats = tips.describe()  # Get summary statistics of the dataset
missing_data = tips.isnull().sum()  # Get the count of missing data in each column

# Function to generate a Word report
def generate_word_report(summary, missing_data, visualizations):
    doc = Document()
    
    # Title
    doc.add_heading('Data Analysis Report', 0)

    # Summary statistics
    doc.add_heading('Summary Statistics', level=1)
    doc.add_paragraph(str(summary))

    # Missing data
    doc.add_heading('Missing Data', level=1)
    doc.add_paragraph(str(missing_data))

    # Add visualizations to the Word document
    doc.add_heading('Visualizations', level=1)
    for viz_path in visualizations:
        doc.add_picture(viz_path, width=5000000, height=3000000)

    # Save the Word document
    doc.save("data/data_analysis_report.docx")
    print("Word report generated successfully.")

# Function to generate an Excel report
def generate_excel_report(df):
    with pd.ExcelWriter("data/data_analysis_report.xlsx") as writer:
        # Write the original data
        df.to_excel(writer, sheet_name='Data', index=False)
        
        # Write summary statistics
        df.describe().to_excel(writer, sheet_name='Summary Statistics')
        
        # Write missing data report
        missing_data_df = pd.DataFrame(missing_data, columns=['Missing Data'])
        missing_data_df.to_excel(writer, sheet_name='Missing Data')
        
    print("Excel report generated successfully.")

# Save visualizations as images
visualizations = []
plt.figure(figsize=(8, 6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap')
heatmap_path = "data/heatmap.png"
plt.savefig(heatmap_path)
visualizations.append(heatmap_path)

plt.figure(figsize=(8, 6))
sns.boxplot(x='day', y='tip', data=tips, palette='Set2')
plt.title('Boxplot of Tips by Day')
boxplot_path = "data/boxplot.png"
plt.savefig(boxplot_path)
visualizations.append(boxplot_path)

plt.figure(figsize=(8, 6))
sns.regplot(x='total_bill', y='tip', data=tips, scatter_kws={'s':20}, line_kws={'color':'red'})
plt.title('Total Bill vs Tip with Regression Line')
regplot_path = "data/regplot.png"
plt.savefig(regplot_path)
visualizations.append(regplot_path)

# Generate Word and Excel reports
generate_word_report(summary_stats, missing_data, visualizations)
generate_excel_report(tips)
