Skip to content

Brayern/Week-7-Data-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

Week-7-Data-Analysis

Data Analysis and Visualization Project

This project performs basic data analysis and visualization on a sample dataset using Python libraries like pandas, numpy, matplotlib, and seaborn.

1. Data Loading and Exploration

The project starts by loading a dataset (in this case, the 'tips' dataset from the seaborn library) into a pandas DataFrame.

Calculate and print the number of missing values for each column

print("\nMissing values per column:") print(df.isnull().sum())

Calculate and display descriptive statistics for numerical columns

print("Descriptive Statistics for Numerical Columns:") display(df.describe())

Group by 'day' and calculate mean of 'total_bill' and 'tip'

print("\nMean Total Bill and Tip by Day:") display(df.groupby('day', observed=False)[['total_bill', 'tip']].mean())

Group by 'sex' and calculate mean of 'total_bill' and 'tip'

print("\nMean Total Bill and Tip by Sex:") display(df.groupby('sex', observed=False)[['total_bill', 'tip']].mean())

Group by 'smoker' and calculate mean of 'total_bill' and 'tip'

print("\nMean Total Bill and Tip by Smoker Status:") display(df.groupby('smoker', observed=False)[['total_bill', 'tip']].mean())

import matplotlib.pyplot as plt import seaborn as sns

1. Create a histogram of the 'total_bill' column

plt.figure(figsize=(10, 6)) sns.histplot(df['total_bill'], kde=True) plt.title('Distribution of Total Bill') plt.xlabel('Total Bill') plt.ylabel('Frequency') plt.show()

2. Create a bar chart showing the average 'tip' amount for each 'day'

plt.figure(figsize=(10, 6)) sns.barplot(data=df, x='day', y='tip', errorbar=None, estimator=np.mean, palette='viridis') plt.title('Average Tip Amount by Day') plt.xlabel('Day of the Week') plt.ylabel('Average Tip') plt.show()

3. Create a scatter plot showing the relationship between 'total_bill' and 'tip'

plt.figure(figsize=(12, 7)) sns.scatterplot(data=df, x='total_bill', y='tip', size='size', hue='time', sizes=(50, 500), alpha=0.6, palette='deep') plt.title('Total Bill vs. Tip by Table Size and Time') plt.xlabel('Total Bill') plt.ylabel('Tip') plt.legend(title='Time', loc='upper left') plt.show()

4. Create a box plot of 'total_bill' by 'time'

plt.figure(figsize=(8, 6)) sns.boxplot(data=df, x='time', y='total_bill', palette='pastel') plt.title('Distribution of Total Bill by Time of Day') plt.xlabel('Time of Day') plt.ylabel('Total Bill') plt.show()

Task

Analyze and visualize a chosen CSV dataset using pandas and matplotlib/seaborn. The analysis should include loading, exploring, cleaning, basic statistical analysis, and creating at least four different types of visualizations (line chart, bar chart, histogram, scatter plot).

Choose and load data

Subtask:

Select a CSV dataset and load it into a pandas DataFrame.

Reasoning: The current code block generates a synthetic dataset, which is not a CSV dataset as requested in the subtask. I need to choose a suitable CSV dataset and load it into a pandas DataFrame. I will use a well-known dataset available in the seaborn library for demonstration purposes, specifically the 'tips' dataset, as it contains both numerical and categorical data suitable for analysis and visualization. I will then load it into a DataFrame named df.

Explore data

Subtask:

Display the head of the DataFrame and check the data types and missing values.

Reasoning: Display the head of the DataFrame and check the data types and missing values as instructed.

Clean data

Subtask:

Handle missing values by either filling or dropping them.

Reasoning: Check for missing values in the DataFrame df to confirm the presence or absence of missing data as instructed.

Basic data analysis

Subtask:

Calculate descriptive statistics and perform group-by operations to find insights.

Reasoning: Calculate descriptive statistics for numerical columns and perform group-by operations as requested in the instructions.

Visualize data

Subtask:

Create four different types of visualizations to explore the data.

Reasoning: Create the requested visualizations to explore the data distribution and relationships between variables.

Refine and interpret

Subtask:

Add titles and labels to plots and interpret the findings.

Reasoning: Review the previously generated plots and their code to ensure titles and labels are present and informative. Then, provide interpretations for each plot. Since the plots were generated in the previous step and included titles and labels, the main task is to provide the interpretations.

Summary:

Data Analysis Key Findings

  • The dataset contains 244 entries with no missing values across any column.
  • Numerical columns (total_bill, tip, size) show varying distributions, with total_bill being right-skewed.
  • Grouping by day reveals slightly higher average total bills and tips on Saturday and Sunday compared to Thursday and Friday.
  • Males, on average, have higher total bills and tips than females.
  • Tables with smokers have slightly higher average total bills than non-smokers, but average tips are very similar between the two groups.
  • There is a positive correlation between the total bill and the tip amount.
  • Dinner bills tend to be higher and have a larger spread than lunch bills.

Insights or Next Steps

  • The analysis suggests that weekend days (Saturday and Sunday) and the "Dinner" time slot are associated with higher spending and tipping. Further investigation into the factors driving these differences could be beneficial.
  • Given the positive correlation between total bill and tip, exploring the tip percentage rather than the absolute tip amount might provide additional insights into tipping behavior.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages