This project performs basic data analysis and visualization on a sample dataset using Python libraries like pandas, numpy, matplotlib, and seaborn.
The project starts by loading a dataset (in this case, the 'tips' dataset from the seaborn library) into a pandas DataFrame.
print("\nMissing values per column:") print(df.isnull().sum())
print("Descriptive Statistics for Numerical Columns:") display(df.describe())
print("\nMean Total Bill and Tip by Day:") display(df.groupby('day', observed=False)[['total_bill', 'tip']].mean())
print("\nMean Total Bill and Tip by Sex:") display(df.groupby('sex', observed=False)[['total_bill', 'tip']].mean())
print("\nMean Total Bill and Tip by Smoker Status:") display(df.groupby('smoker', observed=False)[['total_bill', 'tip']].mean())
import matplotlib.pyplot as plt import seaborn as sns
plt.figure(figsize=(10, 6)) sns.histplot(df['total_bill'], kde=True) plt.title('Distribution of Total Bill') plt.xlabel('Total Bill') plt.ylabel('Frequency') plt.show()
plt.figure(figsize=(10, 6)) sns.barplot(data=df, x='day', y='tip', errorbar=None, estimator=np.mean, palette='viridis') plt.title('Average Tip Amount by Day') plt.xlabel('Day of the Week') plt.ylabel('Average Tip') plt.show()
plt.figure(figsize=(12, 7)) sns.scatterplot(data=df, x='total_bill', y='tip', size='size', hue='time', sizes=(50, 500), alpha=0.6, palette='deep') plt.title('Total Bill vs. Tip by Table Size and Time') plt.xlabel('Total Bill') plt.ylabel('Tip') plt.legend(title='Time', loc='upper left') plt.show()
plt.figure(figsize=(8, 6)) sns.boxplot(data=df, x='time', y='total_bill', palette='pastel') plt.title('Distribution of Total Bill by Time of Day') plt.xlabel('Time of Day') plt.ylabel('Total Bill') plt.show()
Analyze and visualize a chosen CSV dataset using pandas and matplotlib/seaborn. The analysis should include loading, exploring, cleaning, basic statistical analysis, and creating at least four different types of visualizations (line chart, bar chart, histogram, scatter plot).
Select a CSV dataset and load it into a pandas DataFrame.
Reasoning:
The current code block generates a synthetic dataset, which is not a CSV dataset as requested in the subtask. I need to choose a suitable CSV dataset and load it into a pandas DataFrame. I will use a well-known dataset available in the seaborn
library for demonstration purposes, specifically the 'tips' dataset, as it contains both numerical and categorical data suitable for analysis and visualization. I will then load it into a DataFrame named df
.
Display the head of the DataFrame and check the data types and missing values.
Reasoning: Display the head of the DataFrame and check the data types and missing values as instructed.
Handle missing values by either filling or dropping them.
Reasoning:
Check for missing values in the DataFrame df
to confirm the presence or absence of missing data as instructed.
Calculate descriptive statistics and perform group-by operations to find insights.
Reasoning: Calculate descriptive statistics for numerical columns and perform group-by operations as requested in the instructions.
Create four different types of visualizations to explore the data.
Reasoning: Create the requested visualizations to explore the data distribution and relationships between variables.
Add titles and labels to plots and interpret the findings.
Reasoning: Review the previously generated plots and their code to ensure titles and labels are present and informative. Then, provide interpretations for each plot. Since the plots were generated in the previous step and included titles and labels, the main task is to provide the interpretations.
- The dataset contains 244 entries with no missing values across any column.
- Numerical columns (
total_bill
,tip
,size
) show varying distributions, withtotal_bill
being right-skewed. - Grouping by day reveals slightly higher average total bills and tips on Saturday and Sunday compared to Thursday and Friday.
- Males, on average, have higher total bills and tips than females.
- Tables with smokers have slightly higher average total bills than non-smokers, but average tips are very similar between the two groups.
- There is a positive correlation between the total bill and the tip amount.
- Dinner bills tend to be higher and have a larger spread than lunch bills.
- The analysis suggests that weekend days (Saturday and Sunday) and the "Dinner" time slot are associated with higher spending and tipping. Further investigation into the factors driving these differences could be beneficial.
- Given the positive correlation between total bill and tip, exploring the tip percentage rather than the absolute tip amount might provide additional insights into tipping behavior.