## Lab Session 

### Learning Objective:
- Working with data using python libaries.
- Data Visualization.
- Exploratory data analysis and data preprocessing.
- Building a Linear regression model to predict the tip amount based on different input features.

### About the dataset (Customer Tip Data)

#### Dataset Source: https://www.kaggle.com/datasets/ranjeetjain3/seaborn-tips-dataset

The dataset contains information about the 244 orders served at a restaurant in the United States. Each observation includes the factors related to the order like total bill, time, the total number of people in a group, gender of the person paying for the order and so on.

#### Attribute Information:

- **total_bill:** Total bill (cost of the meal), including tax, in US dollars
- **tip:** Tip in US dollars
- **sex:** Sex of person paying for the meal
- **smoker:** There is a smoker in a group or not
- **day:** Day on which the order is served
- **time:** Time of the order
- **size:** Size of the group

Food servers’ tips in restaurants may be influenced by many factors, including the nature of the restaurant, size of the party, and table locations in the restaurant. Restaurant managers need to know which factors matter when they assign tables to food servers. For the sake of staff morale, they usually want to avoid either the substance or the appearance of unfair
treatment of the servers, for whom tips (at least in restaurants in the UnitedStates) are a major component of pay.

### Import required libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Load the dataset

In [None]:
import pandas as pd

file_path = 'path_to_your_dataset.csv'
tips_data = pd.read_csv(file_path)

print(tips_data.head())

### 1. Make a list of categorical and numerical columns in the data.

In [None]:
import pandas as pd


data_types = tips_data.dtypes

categorical_columns = data_types[data_types == 'object'].index.tolist()
numerical_columns = data_types[data_types != 'object'].index.tolist()

print("Categorical Columns:", categorical_columns)
print("\nNumerical Columns:", numerical_columns)

### 2. Compute the average bill amount for each day.

In [None]:
import pandas as pd

average_bill_per_day = tips_data.groupby('day')['total_bill'].mean()


print("Average Bill Amount for Each Day:")
print(average_bill_per_day)

### 3. Which gender is more generous in giving tips?

In [None]:
import pandas as pd



average_tip_per_gender = tips_data.groupby('sex')['tip'].mean()


print("Average Tip Amount for Each Gender:")
print(average_tip_per_gender)

### 4. According to the data, were there more customers for dinner or lunch?

In [None]:
python
Copy code
import pandas as pd


customer_counts = tips_data['time'].value_counts()

print("Number of Customers for Dinner and Lunch:")
print(customer_counts)

### 5. Based on the statistical summary, comment on the variable 'tip'

In [None]:
import pandas as pd

tip_summary = tips_data['tip'].describe()

print("Statistical Summary for 'tip' Variable:")
print(tip_summary)

### 6. Find the busiest day in terms of the orders?

In [None]:
import pandas as pd

busiest_day = tips_data['day'].value_counts().idxmax()

print("Busiest Day in Terms of Orders:", busiest_day)

### 7. Is the variable 'total_bill' skewed? If yes, identify the type of skewness. Support your answer with a plot

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

skewness = tips_data['total_bill'].skew()

plt.figure(figsize=(8, 6))
sns.histplot(tips_data['total_bill'], kde=True)
plt.title('Distribution of Total Bill Amounts')
plt.xlabel('Total Bill')
plt.ylabel('Frequency')
plt.show()

print("Skewness of 'total_bill':", skewness)

### 8. Is the tip amount dependent on the total bill? Visualize the relationship with a appropriate plot and metric and write your findings.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


plt.figure(figsize=(8, 6))
sns.scatterplot(x='total_bill', y='tip', data=tips_data)
plt.title('Relationship Between Total Bill and Tip Amount')
plt.xlabel('Total Bill')
plt.ylabel('Tip Amount')
plt.show()

correlation_coefficient = tips_data['total_bill'].corr(tips_data['tip'])

print("Correlation Coefficient between Total Bill and Tip Amount:", correlation_coefficient)

### 9. What is the percentage of males and females in the dataset? and display it in the plot

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

gender_percentage = tips_data['sex'].value_counts(normalize=True) * 100

print("Percentage of Males and Females:")
print(gender_percentage)

plt.figure(figsize=(6, 6))
plt.pie(gender_percentage, labels=gender_percentage.index, autopct='%1.1f%%', startangle=90, colors=['lightblue', 'lightcoral'])
plt.title('Percentage of Males and Females in the Dataset')
plt.show()

### 10. Compute the gender-wise count based on smoking habits and display it in the plot

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

gender_smoking_count = pd.crosstab(tips_data['sex'], tips_data['smoker'])

print("Gender-wise Count Based on Smoking Habits:")
print(gender_smoking_count)

gender_smoking_count.plot(kind='bar', stacked=True, color=['lightblue', 'lightcoral'])
plt.title('Gender-wise Count Based on Smoking Habits')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show()

### 11. Compute the average tip amount given for different days and display it in the plot.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


average_tip_per_day = tips_data.groupby('day')['tip'].mean().reset_index()

print("Average Tip Amount for Different Days:")
print(average_tip_per_day)

plt.figure(figsize=(8, 6))
sns.barplot(x='day', y='tip', data=average_tip_per_day, palette='viridis')
plt.title('Average Tip Amount for Different Days')
plt.xlabel('Day')
plt.ylabel('Average Tip Amount')
plt.show()

### 12. Is the average bill amount dependent on the size of the group? Visualize the relationship using appropriate plot and write your findings.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

average_bill_per_size = tips_data.groupby('size')['total_bill'].mean().reset_index()

plt.figure(figsize=(8, 6))
sns.scatterplot(x='size', y='total_bill', data=average_bill_per_size)
plt.title('Relationship Between Average Bill Amount and Group Size')
plt.xlabel('Group Size')
plt.ylabel('Average Bill Amount')
plt.show()

correlation_coefficient = average_bill_per_size['size'].corr(average_bill_per_size['total_bill'])

print("Correlation Coefficient between Group Size and Average Bill Amount:", correlation_coefficient)






### 13. Plot a horizontal boxplot to compare the bill amount based on gender

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt


plt.figure(figsize=(10, 6))
sns.boxplot(x='total_bill', y='sex', data=tips_data, orient='h', palette='Set2')
plt.title('Comparison of Bill Amount Based on Gender')
plt.xlabel('Total Bill Amount')
plt.ylabel('Gender')
plt.show()

### 14. Find the maximum bill amount for lunch and dinner on Saturday and Sunday

In [None]:
import pandas as pd


weekend_meals = tips_data[(tips_data['day'].isin(['Sat', 'Sun'])) & (tips_data['time'].isin(['Lunch', 'Dinner']))]

max_bill_amounts = weekend_meals.groupby(['day', 'time'])['total_bill'].max()

print("Maximum Bill Amount for Lunch and Dinner on Saturday and Sunday:")
print(max_bill_amounts)

### 15. Compute the percentage of missing values in the dataset.

In [None]:
import pandas as pd

missing_percentage = (tips_data.isnull().sum() / len(tips_data)) * 100

print("Percentage of Missing Values in the Dataset:")
print(missing_percentage)

### 16. Is there are any duplicate records in the dataset? If yes compute the count of the duplicate records and drop them.

In [None]:
import pandas as pd

duplicate_count = tips_data.duplicated().sum()


print("Count of Duplicate Records:", duplicate_count)

if duplicate_count > 0:
    tips_data = tips_data.drop_duplicates()
    print("Duplicate records dropped.")
else:
    print("No duplicate records found.")

### 17. Is there are any outliers present in the column 'total_bill'? If yes treat them with transformation approach, and plot a boxplot before and after the treatment

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

plt.figure(figsize=(12, 6))
sns.boxplot(x='total_bill', data=tips_data)
plt.title('Boxplot of Total Bill Amount (Before Treatment)')
plt.xlabel('Total Bill Amount')
plt.show()

tips_data['total_bill_log'] = np.log1p(tips_data['total_bill'])

plt.figure(figsize=(12, 6))
sns.boxplot(x='total_bill_log', data=tips_data)
plt.title('Boxplot of Log-Transformed Total Bill Amount (After Treatment)')
plt.xlabel('Log-Transformed Total Bill Amount')
plt.show()
In this code, a boxplot is created before and after applying a log transformation to the 'total_bill' column. The log transformation is just one possible approach; you can explore other transformations based on the nature of your data. Adjust the column names and DataFrame according to your specific dataset structure.







### 18. Is there are any outliers present in the column 'tip'? If yes remove them using IQR techinque.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
sns.boxplot(x='tip', data=tips_data)
plt.title('Boxplot of Tip Amount (Before Outlier Removal)')
plt.xlabel('Tip Amount')
plt.show()

Q1 = tips_data['tip'].quantile(0.25)
Q3 = tips_data['tip'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

tips_data_no_outliers = tips_data[(tips_data['tip'] >= lower_bound) & (tips_data['tip'] <= upper_bound)]

plt.figure(figsize=(8, 6))
sns.boxplot(x='tip', data=tips_data_no_outliers)
plt.title('Boxplot of Tip Amount (After Outlier Removal)')
plt.xlabel('Tip Amount')
plt.show()

### 19. Encode the categorical columns in the dataset and print the random 5 samples from the dataframe.

In [None]:
import pandas as pd

categorical_columns = ['sex', 'smoker', 'day', 'time']

tips_data_encoded = pd.get_dummies(tips_data, columns=categorical_columns)

print("Random 5 Samples from the Encoded DataFrame:")
print(tips_data_encoded.sample(5))

### 20. Check the range of the column 'total_bill' and transform the values such that the range will be 1.

In [None]:
import pandas as pd

current_range = tips_data['total_bill'].max() - tips_data['total_bill'].min()

desired_range = 1

scaling_factor = desired_range / current_range

tips_data['total_bill_scaled'] = tips_data['total_bill'] * scaling_factor

print("Original 'total_bill' column:")
print(tips_data['total_bill'].head())

print("\nTransformed 'total_bill_scaled' column with a range of 1:")
print(tips_data['total_bill_scaled'].head())


### 21. Load the dataset again by giving the name of the dataframe as "tips_df"
- i) Encode the categorical variables.
- ii) Store the target column (i.e.tip) in the y variable and the rest of the columns in the X variable

In [None]:
import pandas as pd

tips_df = pd.read_csv('path_to_your_dataset.csv') 

categorical_columns = ['sex', 'smoker', 'day', 'time']
tips_df_encoded = pd.get_dummies(tips_df, columns=categorical_columns)

X = tips_df_encoded.drop('tip', axis=1)  
y = tips_df_encoded['tip']  

print("Encoded DataFrame:")
print(tips_df_encoded.head())

print("\nX Shape:", X.shape)
print("y Shape:", y.shape)

### 22. Split the dataset into two parts (i.e. 70% train and 30% test), and Standardize the columns "total_bill" and "Size" using the mim_max scaling approach

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import pandas as pd

tips_df = pd.read_csv('path_to_your_dataset.csv') 

X = tips_df[['total_bill', 'size']]
y = tips_df['tip']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

scaler = MinMaxScaler()

X_train[['total_bill', 'size']] = scaler.fit_transform(X_train[['total_bill', 'size']])

X_test[['total_bill', 'size']] = scaler.transform(X_test[['total_bill', 'size']])

print("Scaled Training Data:")
print(X_train.head())

print("\nScaled Test Data:")
print(X_test.head())


### 23. Train a linear regression model using the training data and print the r_squared value of the prediction on the test data.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

tips_df = pd.read_csv('path_to_your_dataset.csv')  
X = tips_df[['total_bill', 'size']]
y = tips_df['tip']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

scaler = MinMaxScaler()

X_train[['total_bill', 'size']] = scaler.fit_transform(X_train[['total_bill', 'size']])

X_test[['total_bill', 'size']] = scaler.transform(X_test[['total_bill', 'size']])

model = LinearRegression()

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

r_squared = r2_score(y_test, y_pred)
print(f'R-squared Value on Test Data: {r_squared}')


### Happy Learning:)