<a href="https://colab.research.google.com/github/Axlbenja/axel.paredes/blob/main/Ramen_%26_Avocado_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install scikit-learn imbalanced-learn

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler, TomekLinks
from imblearn.combine import SMOTETomek
from sklearn.impute import SimpleImputer



#1. Ramen Dataset

In [3]:
#Task 1: Read data and display first five row
df = pd.read_csv("https://raw.githubusercontent.com/fenago/dw/refs/heads/main/ramen-ratings.csv")
print("First 5 rows of the original DataFrame:")
print(df.head())

First 5 rows of the original DataFrame:
            Brand                                            Variety Style  \
0       New Touch                          T's Restaurant Tantanmen    Cup   
1        Just Way  Noodles Spicy Hot Sesame Spicy Hot Sesame Guan...  Pack   
2          Nissin                      Cup Noodles Chicken Vegetable   Cup   
3         Wei Lih                      GGE Ramen Snack Tomato Flavor  Pack   
4  Ching's Secret                                    Singapore Curry  Pack   

  Country  Stars  
0   Japan   3.75  
1  Taiwan   1.00  
2     USA   2.25  
3  Taiwan   2.75  
4   India   3.75  


In [5]:
#Task 2: Add a column with the average rating per country
df['AvgRatingByCountry'] = df.groupby('Country')['Stars'].transform('mean')
print("\nFirst 5 rows after adding AvgRatingByCountry column:")
print(df.head())


First 5 rows after adding AvgRatingByCountry column:
            Brand                                            Variety Style  \
0       New Touch                          T's Restaurant Tantanmen    Cup   
1        Just Way  Noodles Spicy Hot Sesame Spicy Hot Sesame Guan...  Pack   
2          Nissin                      Cup Noodles Chicken Vegetable   Cup   
3         Wei Lih                      GGE Ramen Snack Tomato Flavor  Pack   
4  Ching's Secret                                    Singapore Curry  Pack   

  Country  Stars  AvgRatingByCountry  
0   Japan   3.75            3.981605  
1  Taiwan   1.00            3.665402  
2     USA   2.25            3.457043  
3  Taiwan   2.75            3.665402  
4   India   3.75            3.395161  


In [6]:
#Task 3: Group by Country and Style, then take the mean and unstack the Style column
country_style_avg = df.groupby(['Country', 'Style'])['Stars'].mean().unstack()
print("\nGrouped data by Country and Style, unstacked for Stars:")
print(country_style_avg)


Grouped data by Country and Style, unstacked for Stars:
Style          Bar      Bowl   Box  Can       Cup      Pack      Tray
Country                                                              
Australia      NaN       NaN   NaN  NaN  3.120588  3.200000       NaN
Bangladesh     NaN       NaN   NaN  NaN       NaN  3.714286       NaN
Brazil         NaN       NaN   NaN  NaN  4.500000  4.250000       NaN
Cambodia       NaN       NaN   NaN  NaN       NaN  4.200000       NaN
Canada         NaN  2.281250   NaN  NaN  1.970588  2.515625       NaN
China          NaN  3.527778   NaN  NaN  2.859375  3.538776  2.583333
Colombia       NaN       NaN   NaN  NaN  3.083333  3.500000       NaN
Dubai          NaN       NaN   NaN  NaN       NaN  3.583333       NaN
Estonia        NaN       NaN   NaN  NaN       NaN  3.500000       NaN
Fiji           NaN       NaN   NaN  NaN       NaN  3.875000       NaN
Finland        NaN       NaN   NaN  NaN       NaN  3.583333       NaN
Germany        NaN       NaN   Na

In [7]:
#Task 4: Function to shorten Variety names
def shorten_variety(variety):
    words = variety.split()[:2] + ['...']  # Take first two words and add ellipsis
    return ' '.join(words)

In [8]:
#Task 5: Apply function to create ShortVariety column
df['ShortVariety'] = df['Variety'].apply(shorten_variety)
print("\nFirst 5 rows after adding ShortVariety column:")
print(df.head())


First 5 rows after adding ShortVariety column:
            Brand                                            Variety Style  \
0       New Touch                          T's Restaurant Tantanmen    Cup   
1        Just Way  Noodles Spicy Hot Sesame Spicy Hot Sesame Guan...  Pack   
2          Nissin                      Cup Noodles Chicken Vegetable   Cup   
3         Wei Lih                      GGE Ramen Snack Tomato Flavor  Pack   
4  Ching's Secret                                    Singapore Curry  Pack   

  Country  Stars  AvgRatingByCountry         ShortVariety  
0   Japan   3.75            3.981605   T's Restaurant ...  
1  Taiwan   1.00            3.665402    Noodles Spicy ...  
2     USA   2.25            3.457043      Cup Noodles ...  
3  Taiwan   2.75            3.665402        GGE Ramen ...  
4   India   3.75            3.395161  Singapore Curry ...  


In [9]:
#Task 6: Set index to Brand and display first five rows
df.set_index('Brand', inplace=True)
print("\nFirst 5 rows after setting Brand as index:")
print(df.head())


First 5 rows after setting Brand as index:
                                                          Variety Style  \
Brand                                                                     
New Touch                               T's Restaurant Tantanmen    Cup   
Just Way        Noodles Spicy Hot Sesame Spicy Hot Sesame Guan...  Pack   
Nissin                              Cup Noodles Chicken Vegetable   Cup   
Wei Lih                             GGE Ramen Snack Tomato Flavor  Pack   
Ching's Secret                                    Singapore Curry  Pack   

               Country  Stars  AvgRatingByCountry         ShortVariety  
Brand                                                                   
New Touch        Japan   3.75            3.981605   T's Restaurant ...  
Just Way        Taiwan   1.00            3.665402    Noodles Spicy ...  
Nissin             USA   2.25            3.457043      Cup Noodles ...  
Wei Lih         Taiwan   2.75            3.665402        GGE Rame

In [10]:
#Task 7: Reset index to default
df.reset_index(inplace=True)
print("\nFirst 5 rows after resetting the index:")
print(df.head())


First 5 rows after resetting the index:
            Brand                                            Variety Style  \
0       New Touch                          T's Restaurant Tantanmen    Cup   
1        Just Way  Noodles Spicy Hot Sesame Spicy Hot Sesame Guan...  Pack   
2          Nissin                      Cup Noodles Chicken Vegetable   Cup   
3         Wei Lih                      GGE Ramen Snack Tomato Flavor  Pack   
4  Ching's Secret                                    Singapore Curry  Pack   

  Country  Stars  AvgRatingByCountry         ShortVariety  
0   Japan   3.75            3.981605   T's Restaurant ...  
1  Taiwan   1.00            3.665402    Noodles Spicy ...  
2     USA   2.25            3.457043      Cup Noodles ...  
3  Taiwan   2.75            3.665402        GGE Ramen ...  
4   India   3.75            3.395161  Singapore Curry ...  


#2. Avocado Dataset

In [21]:
#1. Read the data from the .csv file into a DataFrame and display the first five rows.
url = 'https://raw.githubusercontent.com/fenago/dw/refs/heads/main/avocado.csv'
df = pd.read_csv(url)
print("First 5 rows of the original DataFrame:")
print(df.head())

First 5 rows of the original DataFrame:
   Unnamed: 0        Date  AveragePrice  Total Volume     4046       4225  \
0           0  2015-12-27          1.33      64236.62  1036.74   54454.85   
1           1  2015-12-20          1.35      54876.98   674.28   44638.81   
2           2  2015-12-13          0.93     118220.22   794.70  109149.67   
3           3  2015-12-06          1.08      78992.15  1132.00   71976.41   
4           4  2015-11-29          1.28      51039.60   941.48   43838.39   

     4770  Total Bags  Small Bags  Large Bags  XLarge Bags          type  \
0   48.16     8696.87     8603.62       93.25          0.0  conventional   
1   58.33     9505.56     9408.07       97.49          0.0  conventional   
2  130.50     8145.35     8042.21      103.14          0.0  conventional   
3   72.58     5811.16     5677.40      133.76          0.0  conventional   
4   75.78     6183.95     5986.26      197.69          0.0  conventional   

   year  region  
0  2015  Albany  
1  2

In [23]:
#2. Change the column names to title case with no spaces.
df.columns = [col.replace(' ', '').title() for col in df.columns]
print("\nColumn names after standardization:")
print(df.columns)


Column names after standardization:
Index(['Unnamed:0', 'Date', 'Averageprice', 'Totalvolume', '4046', '4225',
       '4770', 'Totalbags', 'Smallbags', 'Largebags', 'Xlargebags', 'Type',
       'Year', 'Region'],
      dtype='object')


In [27]:
#3. Add a column with the percentage of the bags in the TotalBags column that are extra-large bags.
df['ExtraLargeBagPercentage'] = (df['Xlargebags'] / df['Totalbags']) * 100

In [30]:
#4. Add a column with the percentage of the bags in the TotalBags column that are large bags.
df['LargeBagPercentage'] = (df['Largebags'] / df['Totalbags']) * 100

In [32]:
#5. Add a column with the percentage of the bags in the TotalBags column that are small bags.
df['SmallBagPercentage'] = (df['Smallbags'] / df['Totalbags']) * 100

In [33]:
#6.Display the first five rows of data to view the new columns.
print("\nFirst 5 rows after adding new percentage columns:")
print(df.head())


First 5 rows after adding new percentage columns:
   Unnamed:0        Date  Averageprice  Totalvolume     4046       4225  \
0          0  2015-12-27          1.33     64236.62  1036.74   54454.85   
1          1  2015-12-20          1.35     54876.98   674.28   44638.81   
2          2  2015-12-13          0.93    118220.22   794.70  109149.67   
3          3  2015-12-06          1.08     78992.15  1132.00   71976.41   
4          4  2015-11-29          1.28     51039.60   941.48   43838.39   

     4770  Totalbags  Smallbags  Largebags  Xlargebags          Type  Year  \
0   48.16    8696.87    8603.62      93.25         0.0  conventional  2015   
1   58.33    9505.56    9408.07      97.49         0.0  conventional  2015   
2  130.50    8145.35    8042.21     103.14         0.0  conventional  2015   
3   72.58    5811.16    5677.40     133.76         0.0  conventional  2015   
4   75.78    6183.95    5986.26     197.69         0.0  conventional  2015   

   Region  ExtraLargeBagPerce

In [35]:
#7. Assign the Region, Type, Year, and TotalBags columns to a new DataFrame.
new_df = df[['Region', 'Type', 'Year', 'Totalbags']].copy()
print("\nNew DataFrame with selected columns:")
print(new_df.head())


New DataFrame with selected columns:
   Region          Type  Year  Totalbags
0  Albany  conventional  2015    8696.87
1  Albany  conventional  2015    9505.56
2  Albany  conventional  2015    8145.35
3  Albany  conventional  2015    5811.16
4  Albany  conventional  2015    6183.95
