# Machine learning project
## Predicting category of a product
**Author**: Sanjin Jareb

### Step 1:
Loading dataset that we will work on

In [19]:
import pandas as pd

df = pd.read_csv('products.csv')

# Display the first 5 rows of the DataFrame
print(df.head())

   product ID                                      Product Title  Merchant ID  \
0           1                    apple iphone 8 plus 64gb silver            1   
1           2                apple iphone 8 plus 64 gb spacegrau            2   
2           3  apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...            3   
3           4                apple iphone 8 plus 64gb space grey            4   
4           5  apple iphone 8 plus gold 5.5 64gb 4g unlocked ...            5   

   Category Label _Product Code  Number_of_Views  Merchant Rating  \
0   Mobile Phones    QA-2276-XC            860.0              2.5   
1   Mobile Phones    KA-2501-QO           3772.0              4.8   
2   Mobile Phones    FP-8086-IE           3092.0              3.9   
3   Mobile Phones    YI-0086-US            466.0              3.4   
4   Mobile Phones    NZ-3586-WP           4426.0              1.6   

   Listing Date    
0       5/10/2024  
1      12/31/2024  
2      11/10/2024  
3        5/2/2022 

## Step 2:
Check values in dataframe and standardize them

In [20]:
# Display the data types of each column
print(df.dtypes)
# Check for missing values in each column
print(df.isnull().sum())

product ID           int64
Product Title       object
Merchant ID          int64
 Category Label     object
_Product Code       object
Number_of_Views    float64
Merchant Rating    float64
 Listing Date       object
dtype: object
product ID           0
Product Title      172
Merchant ID          0
 Category Label     44
_Product Code       95
Number_of_Views     14
Merchant Rating    170
 Listing Date       59
dtype: int64


As we can see, there are a lot of missing values in some important columns for our research. Because of that we will delete all rows with missing values in any of following columns: **"Product Title", "Category Label", "_Product Code"**

In [23]:
#Standardizing column names by removing leading/trailing spaces and underscore on the beginning
df.columns = df.columns.str.strip().str.lstrip('_')

# # Deleting unnecessary columns
columns_to_delete = ['Number_of_Views', 'Merchant Rating', 'Listing Date']
df.drop(columns=columns_to_delete, inplace=True)

# Deleting rows with missing values in important columns
important_columns = ['Product Title', 'Category Label', 'Product Code']
df.dropna(subset=important_columns, inplace=True)

# Resetting the index after row deletions
df.reset_index(drop=True, inplace=True)
# Making sure there are no missing values
print(df.isnull().sum())

product ID        0
Product Title     0
Merchant ID       0
Category Label    0
Product Code      0
dtype: int64


Now we have dataframe with only important columns for the research. We will display it now, so that we have clear view on the new dataframe

In [24]:
print(df.head(10))

   product ID                                      Product Title  Merchant ID  \
0           1                    apple iphone 8 plus 64gb silver            1   
1           2                apple iphone 8 plus 64 gb spacegrau            2   
2           3  apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...            3   
3           4                apple iphone 8 plus 64gb space grey            4   
4           5  apple iphone 8 plus gold 5.5 64gb 4g unlocked ...            5   
5           6  apple iphone 8 plus gold 5.5 64gb 4g unlocked ...            6   
6           7               apple iphone 8 plus 64 gb space grey            7   
7           8                apple iphone 8 plus 64gb space grey            8   
8           9                apple iphone 8 plus 64gb space grey            9   
9          10                apple iphone 8 plus 64gb space grey           10   

  Category Label Product Code  
0  Mobile Phones   QA-2276-XC  
1  Mobile Phones   KA-2501-QO  
2  Mobile Ph