## Problem Statement

You work for a fitness company and have gathered data on the fitness activities of 50 individuals using fitness trackers. The dataset is stored in an Excel file named "fitness_data.xlsx" and includes the following columns:

- **name:** Name of the person.
- **steps_taken:** The number of steps taken by individuals.
- **calories_burned:** The estimated calories burned by individuals.
- **sleep_duration(hours):** The number of hours of sleep individuals got on that day.
- **water_intake(ounces):** The amount of water individuals consumed.


In [1]:
import pandas as pd
import matplotlib.pyplot as plt


**Import Necessary Libraries**

In [2]:
df = pd.read_excel("fitness_data.xlsx")


## Task1

1. Import the data from the "fitness_data.xlsx" Excel file.
2. Display the first few rows of the dataset to get an overview.
3. Calculate and display basic statistics (mean, median, min, max) for each column.


In [None]:
df.head()


Unnamed: 0,name,steps_taken,calories_burned,sleep_duration(hours),water_intake(ounces)
0,Akshay,10500,4500,7.5,80
1,Priya,9800,4200,7.2,75
2,Raj,11500,4800,7.0,90
3,Emily,12000,5000,7.8,85
4,Rohit,8900,4000,7.0,70


In [9]:
df.describe()

Unnamed: 0,steps_taken,calories_burned,sleep_duration(hours),water_intake(ounces)
count,50.0,50.0,50.0,50.0
mean,10316.0,4418.0,7.396,79.6
std,1177.052701,370.708092,1.660951,14.457538
min,8000.0,3700.0,4.0,30.0
25%,9625.0,4200.0,7.0,70.0
50%,10250.0,4400.0,7.2,80.0
75%,11000.0,4700.0,7.5,90.0
max,15000.0,5500.0,18.0,100.0


## Task2:  Range and IQR

1. Calculate the range of "steps_taken".
2. Calculate the range of "calories_burned".
3. Calculate the Interquartile Range (IQR) for "sleep_duration(hours)".
4. Calculate the IQR for "water_intake(ounces)".

In [18]:
# Calculate the range of "steps_taken" for the entire week
steps_range = df["steps_taken"].max() - df["steps_taken"].min()
print(f"steps taken range: {steps_range}")

# Calculate the range of "calories_burned" for the entire week
calories_burned_range = df["calories_burned"].max() - df["calories_burned"].min()
print(f"calories burned range: {calories_burned_range}")
# Calculate the Interquartile Range (IQR) for "sleep_duration(hours)"
quantile_sleep_1, quantile_sleep_3 = df["sleep_duration(hours)"].quantile([0.25, 0.75])
print(quantile_sleep_1,quantile_sleep_3)
iqr_sleep = quantile_sleep_3 = quantile_sleep_1
print(f"iqr sleep: {iqr_sleep}")
# Calculate the IQR for "water_intake(ounces)"
quantile_water_1 , quantile_water_3 = df["water_intake(ounces)"].quantile([0.25, 0.75])
iqr_water = quantile_water_3 - quantile_water_1
print(f"iqr water: {iqr_water}")
# Print the results


steps taken range: 7000
calories burned range: 1800
7.0 7.5
iqr sleep: 7.0
iqr water: 20.0


In [16]:
df["sleep_duration(hours)"].value_counts()

sleep_duration(hours)
7.5     9
7.0     9
7.2     6
8.0     6
6.5     6
6.8     5
7.8     4
7.3     3
18.0    1
4.0     1
Name: count, dtype: int64

## Task3: Box Plot for Steps Taken

- Create a box plot for the "steps_taken" column to visualize the distribution of daily steps taken by individuals. Interpret the box plot and identify any outliers.

In [4]:


# Set the figure size


# Create a box plot for "Steps Taken"


# Set the title and labels


# Rotate x-axis labels for better readability


# Ensure proper layout and display the plot


#### Observations

- Most individuals appear to have a median daily step count around 10,000 as indicated by the orange line within the box.
- The presence of an outlier at 15,000 indicates that there is at least one individual who took an exceptionally high number of steps. This could be due to various reasons, such as an unusually active day or a measurement error.


## Task4: 

- Use the IQR method to identify and label outliers in the "sleep_duration(hours)" column.

In [20]:
#defininig the function
def get_outliers(data):
    Q1, Q3 = df["sleep_duration(hours)"].quantile([0.25,0.75])
    IQR = Q3 - Q1
    low_range = Q1 - 1.5*IQR
    high_range = Q3 + 1.5*IQR
    
    return low_range, high_range


In [21]:
#get the lower and upper limits
low_range, high_range = get_outliers(df)
print(low_range, high_range)

6.25 8.25


In [25]:
# Identify and label outliers
df_no_outliers = df[(df["sleep_duration(hours)"] >= low_range) & ( df["sleep_duration(hours)"] <= high_range)]


# Display the outliers
outliers = df[(df["sleep_duration(hours)"] < low_range) | (df["sleep_duration(hours)"] > high_range)]
print(outliers)

         name  steps_taken  calories_burned  sleep_duration(hours)  \
21      Elena        11700             4900                   18.0   
30  Siddharth        11300             4700                    4.0   

    water_intake(ounces)  
21                   100  
30                    75  
