## Lab Session 

### Learning Objective:
- Working with data using python libaries.
- Data Visualization.
- Exploratory data analysis and data preprocessing.
- Building a Linear regression model to predict the tip amount based on different input features.

### About the dataset (Customer Tip Data)

#### Dataset Source: https://www.kaggle.com/datasets/ranjeetjain3/seaborn-tips-dataset

The dataset contains information about the 244 orders served at a restaurant in the United States. Each observation includes the factors related to the order like total bill, time, the total number of people in a group, gender of the person paying for the order and so on.

#### Attribute Information:

- **total_bill:** Total bill (cost of the meal), including tax, in US dollars
- **tip:** Tip in US dollars
- **sex:** Sex of person paying for the meal
- **smoker:** There is a smoker in a group or not
- **day:** Day on which the order is served
- **time:** Time of the order
- **size:** Size of the group

Food servers’ tips in restaurants may be influenced by many factors, including the nature of the restaurant, size of the party, and table locations in the restaurant. Restaurant managers need to know which factors matter when they assign tables to food servers. For the sake of staff morale, they usually want to avoid either the substance or the appearance of unfair
treatment of the servers, for whom tips (at least in restaurants in the UnitedStates) are a major component of pay.

### Import required libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt  
import seaborn as sns  

ModuleNotFoundError: No module named 'pandas'

### Load the dataset

In [2]:
url = "https://www.kaggle.com/datasets/ranjeetjain3/seaborn-tips-dataset/raw/tips.csv"
data = pd.read_csv(url)
print(data.head())

NameError: name 'pd' is not defined

### 1. Make a list of categorical and numerical columns in the data.

In [3]:
data = {
    'total_bill': [21.5, 17.7, 30.0, 23.1, 26.0, 14.1, 16.9, 29.0, 10.3, 25.3],
    'tip': [3.50, 3.00, 5.00, 2.00, 4.00, 1.60, 2.70, 5.00, 1.00, 4.70],
    'sex': ['Male', 'Female', 'Male', 'Male', 'Female', 'Female', 'Female', 'Male', 'Male', 'Female'],
    'smoker': ['No', 'Yes', 'No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes'],
    'day': ['Sun', 'Sat', 'Fri', 'Sun', 'Thur', 'Fri', 'Sat', 'Fri', 'Thur', 'Fri'],
    'time': ['Dinner', 'Lunch', 'Dinner', 'Dinner', 'Dinner', 'Lunch', 'Dinner', 'Dinner', 'Lunch', 'Dinner'],
    'size': [2, 2, 4, 3, 3, 2, 2, 4, 2, 3]
}

df = pd.DataFrame(data)
categorical_columns = [col for col in df.columns if df[col].dtype == 'object']
numerical_columns = [col for col in df.columns if col not in categorical_columns]

print("Categorical Columns:", categorical_columns)
print("Numerical Columns:", numerical_columns)

NameError: name 'pd' is not defined

### 2. Compute the average bill amount for each day.

In [None]:
avg_bill_per_day = df.groupby('day')['total_bill'].mean()
print(avg_bill_per_day)

### 3. Which gender is more generous in giving tips?

In [None]:
avg_tip_per_gender = df.groupby('sex')['tip'].mean()
most_generous_gender = avg_tip_per_gender.idxmax()
print(f"The gender with the higher average tip amount is: {most_generous_gender}")

### 4. According to the data, were there more customers for dinner or lunch?

In [None]:
n_lunch = df[df['time'] == 'Lunch'].shape[0]
n_dinner = df[df['time'] == 'Dinner'].shape[0]
more_customers = 'Dinner' if n_dinner > n_lunch else 'Lunch'
print(f"According to the data, there were more customers for {more_customers}.")

### 5. Based on the statistical summary, comment on the variable 'tip'

In [None]:
tip_stats = df['tip'].describe()
comment_data_type = f"The 'tip' variable is of data type {df['tip'].dtype}.\n"

if tip_stats['skew'] > 0:
    comment_central_tendency = f"The median tip amount is {tip_stats['50%']:.2f} dollars. Since the data might be skewed to the right (having more high values), the median provides a better representation of the 'center' of the tip amounts compared to the mean.\n"
else:
    comment_central_tendency = f"The average tip amount is {tip_stats['mean']:.2f} dollars. This indicates that on average, customers tip {tip_stats['mean']:.2f} dollars per bill.\n"

comment_spread = f"The standard deviation of the tip amount is {tip_stats['std']:.2f} dollars. This suggests there is a variability in tip amounts, with some customers tipping significantly more or less than the average.\n"

In [None]:
iqr = tip_stats['75%'] - tip_stats['25%']
lower_bound = tip_stats['Q1'] - (1.5 * iqr)
upper_bound = tip_stats['Q3'] + (1.5 * iqr)
outliers_count = df[(df['tip'] < lower_bound) | (df['tip'] > upper_bound)].shape[0]
comment_outliers = (
    f"There might be {outliers_count} outliers in the tip data based on the Interquartile Range (IQR). These are data points that fall outside the range of [{"{:.2f}".format(lower_bound)}, {"{:.2f}".format(upper_bound)}] dollars.\n"
    if outliers_count > 0
    else ""
)

In [None]:
all_comments = comment_data_type + comment_central_tendency + comment_spread + comment_outliers

In [None]:
print("Comments on the 'tip' variable based on statistical summary:")
print(all_comments)

### 6. Find the busiest day in terms of the orders?

In [None]:
n_orders_per_day = df['day'].value_counts()
busiest_day = n_orders_per_day.idxmax()
print(f"The busiest day in terms of orders is: {busiest_day}")

### 7. Is the variable 'total_bill' skewed? If yes, identify the type of skewness. Support your answer with a plot

In [None]:
bill_skewness = df['total_bill'].skew()

plt.hist(df['total_bill'])
plt.xlabel('Total Bill Amount (USD)')
plt.ylabel('Number of Orders')
plt.title('Distribution of Total Bill Amounts')
plt.show()

skew_type = "no skewness" 
if bill_skewness > 0:
    skew_type = "positive skew (right-skewed)"
elif bill_skewness < 0:
    skew_type = "negative skew (left-skewed)"

print(f"The skewness of the 'total_bill' variable is {bill_skewness:.2f}.")
print(f"The distribution of 'total_bill' exhibits {skew_type}.")

### 8. Is the tip amount dependent on the total bill? Visualize the relationship with a appropriate plot and metric and write your findings.

In [None]:
correlation = df['tip'].corr(df['total_bill'])

plt.scatter(df['total_bill'], df['tip'])
plt.xlabel('Total Bill Amount (USD)')
plt.ylabel('Tip Amount (USD)')
plt.title('Relationship Between Total Bill and Tip Amount')
plt.show()

print(f"The correlation coefficient between 'tip' and 'total_bill' is {correlation:.2f}.")

finding_text = ""
if abs(correlation) > 0.7:  
    finding_text = "There is a strong positive/negative correlation between tip amount and total bill amount."
elif abs(correlation) > 0.5:  
    finding_text = "There is a moderate positive/negative correlation between tip amount and total bill amount."
elif abs(correlation) > 0.3: 
    finding_text = "There is a weak positive/negative correlation between tip amount and total bill amount."
else:  
    finding_text = "There is a very weak or negligible correlation between tip amount and total bill amount."

if correlation > 0:
    finding_text += " (Positive Correlation)"
else:
    finding_text += " (Negative Correlation)"

print(finding_text)

### 9. What is the percentage of males and females in the dataset? and display it in the plot

In [None]:
n_male = df[df['sex'] == 'Male'].shape[0]
n_female = df[df['sex'] == 'Female'].shape[0]

pct_male = (n_male / (n_male + n_female)) * 100
pct_female = (n_female / (n_male + n_female)) * 100

labels = ['Male', 'Female']
sizes = [pct_male, pct_female]
explode = (0.1, 0)  

plt.pie(sizes, explode=explode, labels=labels, autopct="%1.1f%%", shadow=True, startangle=140)
plt.title('Gender Distribution')
plt.axis('equal')  
plt.show()

print(f"The percentage of males in the dataset is: {pct_male:.1f}%")
print(f"The percentage of females in the dataset is: {pct_female:.1f}%")

### 10. Compute the gender-wise count based on smoking habits and display it in the plot

In [None]:
gender_smoker_counts = df.groupby(['sex', 'smoker']).size().unstack()

gender_smoker_counts.plot(kind='bar', stacked=True, color=['skyblue', 'lightgreen'])
plt.xlabel('Gender')
plt.ylabel('Count')
plt.title('Gender-wise Count Based on Smoking Habits')
plt.xticks(rotation=0) 
plt.legend(title='Smoker')  
plt.subplots_adjust(bottom=0.3)  
plt.show()

### 11. Compute the average tip amount given for different days and display it in the plot.

In [None]:
avg_tip_per_day = df.groupby('day')['tip'].mean()

plt.bar(avg_tip_per_day.index, avg_tip_per_day.values)
plt.xlabel('Day')
plt.ylabel('Average Tip Amount (USD)')
plt.title('Average Tip Amount by Day')
plt.xticks(rotation=0)  
plt.show()

### 12. Is the average bill amount dependent on the size of the group? Visualize the relationship using appropriate plot and write your findings.

In [None]:
sns.violinplot(
    x = "size",
    y = "total_bill",
    showmeans=True,
    data=data
)

plt.xticks(rotation=45)
plt.show()

### 13. Plot a horizontal boxplot to compare the bill amount based on gender

In [None]:
sns.boxplot(
    x = "gender",
    y = "bill_amount",
    showmeans=True,
    data=data,
    orient="h", 
    palette="Set3"
)

plt.xticks(rotation=45)
plt.show()

### 14. Find the maximum bill amount for lunch and dinner on Saturday and Sunday

In [None]:
weekend_lunch_dinner = df[(df['day'].isin(['Sun', 'Sat'])) & (df['time'].isin(['Lunch', 'Dinner']))]

max_bill_lunch = weekend_lunch_dinner[weekend_lunch_dinner['time'] == 'Lunch']['total_bill'].max()
max_bill_dinner = weekend_lunch_dinner[weekend_lunch_dinner['time'] == 'Dinner']['total_bill'].max()

print("Maximum bill amount for Lunch on Saturday and Sunday:", max_bill_lunch)
print("Maximum bill amount for Dinner on Saturday and Sunday:", max_bill_dinner)

### 15. Compute the percentage of missing values in the dataset.

In [None]:
missing_values = (df.isna().sum() / len(df)) * 100

print("Percentage of missing values in the dataset:")
print(missing_values.to_string())

### 16. Is there are any duplicate records in the dataset? If yes compute the count of the duplicate records and drop them.

In [None]:
duplicates = df.duplicated()

if duplicates.any():
  num_duplicates = duplicates.sum()
  print(f"There are {num_duplicates} duplicate records in the dataset.")
  
  df = df.drop_duplicates()
  print(f"Duplicates have been dropped. The dataframe now has {len(df)} rows.")
else:
  print("There are no duplicate records in the dataset.")

### 17. Is there are any outliers present in the column 'total_bill'? If yes treat them with transformation approach, and plot a boxplot before and after the treatment

In [None]:
Q1 = df['total_bill'].quantile(0.25)
Q3 = df['total_bill'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = df[(df['total_bill'] < lower_bound) | (df['total_bill'] > upper_bound)]

print(f"Number of outliers in 'total_bill' before treatment: {len(outliers)}")

df_outlier_treated = df.copy()

df_outlier_treated['total_bill'] = df_outlier_treated['total_bill'].clip(lower_bound, upper_bound)

outliers_treated = df_outlier_treated[(df_outlier_treated['total_bill'] < lower_bound) | (df_outlier_treated['total_bill'] > upper_bound)]
print(f"Number of outliers in 'total_bill' after treatment: {len(outliers_treated)}")

sns.boxplot(
    x = "total_bill",
    showmeans=True,
    data=df
)
plt.title("Total Bill Distribution (Before Outlier Treatment)")
plt.show()

sns.boxplot(
    x = "total_bill",
    showmeans=True,
    data=df_outlier_treated
)
plt.title("Total Bill Distribution (After Outlier Treatment)")
plt.show()

### 18. Is there are any outliers present in the column 'tip'? If yes remove them using IQR techinque.

In [None]:
Q1 = df['tip'].quantile(0.25)
Q3 = df['tip'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR


outliers = df[(df['tip'] < lower_bound) | (df['tip'] > upper_bound)]

if len(outliers) > 0:
  print(f"Outliers detected in 'tip':")
  print(outliers)
else:
  print("No outliers detected in 'tip'")


df_filtered = df.query("tip not in @outliers['tip']")

sns.boxplot(
x = "tip",
showmeans=True,
data=df
)

plt.title("Tip Distribution (Before Outlier Treatment)")
plt.show()

sns.boxplot(
x = "tip",
showmeans=True,
data=df_filtered
)

plt.title("Tip Distribution (After Outlier Treatment)")
plt.show()

### 19. Encode the categorical columns in the dataset and print the random 5 samples from the dataframe.

In [None]:
categorical_cols = df.select_dtypes(include=['object']).columns

encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
encoded_df = pd.concat([df.drop(categorical_cols, axis=1), pd.DataFrame(encoder.fit_transform(df[categorical_cols]))], axis=1)

print(encoded_df.sample(5))

### 20. Check the range of the column 'total_bill' and transform the values such that the range will be 1.

In [None]:
bill_range = df['total_bill'].max() - df['total_bill'].min()

def scale_to_range(x, old_min, old_max, new_min, new_max):
  return ((x - old_min) * (new_max - new_min)) / (old_max - old_min) + new_min

df['scaled_bill'] = df['total_bill'].apply(lambda x: scale_to_range(x, df['total_bill'].min(), df['total_bill'].max(), 0, 1))

print(df)

### 21. Load the dataset again by giving the name of the dataframe as "tips_df"
- i) Encode the categorical variables.
- ii) Store the target column (i.e.tip) in the y variable and the rest of the columns in the X variable

In [None]:
categorical_cols = tips_df.select_dtypes(include=['object']).columns

encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
encoded_df = pd.concat([tips_df.drop(categorical_cols, axis=1), pd.DataFrame(encoder.fit_transform(tips_df[categorical_cols]))], axis=1)

y = encoded_df['tip']

X = encoded_df.drop('tip', axis=1)

print(f"Shape of X: {X.shape}")
print(f"Shape of y: {y.shape}")

### 22. Split the dataset into two parts (i.e. 70% train and 30% test), and Standardize the columns "total_bill" and "Size" using the mim_max scaling approach

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

X_train, X_test, y_train, y_test = train_test_split(tips_df.drop('tip', axis=1), tips_df['tip'], test_size=0.3, random_state=42)

scaler = MinMaxScaler()

columns_to_scale = ['total_bill', 'size']

scaler.fit(X_train[columns_to_scale])

X_train[columns_to_scale] = scaler.transform(X_train[columns_to_scale])
X_test[columns_to_scale] = scaler.transform(X_test[columns_to_scale])

print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

### 23. Train a linear regression model using the training data and print the r_squared value of the prediction on the test data.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

model = LinearRegression()

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

r2 = r2_score(y_test, y_pred)

print(f"R-squared value: {r2:.4f}")

### Happy Learning:)