# Collectible Car Inventory Management

In this notebook, I delve into a raw dataset containing a comprehensive list of collectible cars from my father's extensive collection. The aim is to transparently document the journey of transforming this initial dataset into a cleaned and well-structured format. This transformation is crucial as it lays the groundwork for developing a smart inventory system. With this system, my father will be able to manage his collection more efficiently and effectively.

# Project Outline

1. **Data Exploration:** Understanding the composition and structure of the raw data.
2. **Data Restructuring:** Reformatting the dataset into a more usable format that aligns with the needs of an inventory management system.
3. **Data Cleaning:** Addressing inconsistencies, missing values, and any inaccuracies to refine the dataset.

By the end of this notebook, the goal is to have a dataset that not only provides a clear view of the collection but is also optimized for integration into an inventory management application. This will enable easy updates and management, supporting both current enjoyment and future legacy planning of the collection.

Join me as I transform raw data into a powerful tool for collection management!


# 1. Data Exploration

## 1.1 Initial Data Loading and Inspection

Let's start with loading the dataset from the Excel file, provided by my client (dad). We would like to get some initial idea of how it looks like.

In [None]:
import pandas as pd

# Filepath
file_path = '/kaggle/input/cars-catalogue-main-raw/Cars catalogue Main_RAW.xlsx'

# Load the data from the Excel file
data = pd.read_excel(file_path)

# Display the structure of the DataFrame
print("Data Info:")
data.info()

# Display the first few rows to understand the data better
print("First few rows of the dataset:")
display(data.head())


## 1.1.1 Brief overview based on the first 5 rows:

* The dataset from the "Cars catalogue Main_RAW.xlsx" file has a total of **410 entries** across **181 columns**. 
* Each column represents a different collection or series of car models.
 * The dataset includes a variety of car collections, such as "Altaya," "Opel Collection journal," "Opel Collection Designer-Serie," "Deutsche Liebhaber-autos," "Auto Plus - Les classiques de l'automobile," "Mercedes-Benz journal series (Ixo)," and others.
* The entries in these columns describe specific car models, sometimes along with some details like model year, color, or specific edition information, without any consistent formatting. Such as "Matra Simca Rancho 1978", "Opel 10/40 PS Modell 80 1925-1929", "Mercedes-Benz 280 SL 1963 Roadster" and others.

## 1.1.2 Observations from the Data

* **Sparsity:** Many columns have missing values, indicating incomplete records.
* **Inconsistency:** Descriptions vary widely, with mixed information like model, year, and color.

## The next steps would include:

* **Data Structuring:** Restructuring the data to a more analysis-friendly format, potentially consolidating similar columns or reorganizing the dataset based on specific attributes like model year, make, or collection.
* **Data Cleaning:** Handling missing data, standardizing text entries, and possibly reformatting data for better usability.
* **Analysis/Visualization Preparation:** Identifying key variables for analysis and preparing the dataset accordingly for importing into Looker Studio for visualization.

# 2. Data Restructuring

Let's proceed with restructuring the dataset into two columns: one for the item manufacturer's name and another for the full item name. This will make the dataset simpler and more suitable for further analysis, visualizations and overall use. We'll create a new DataFrame with this format, where each row will represent a collectible with its manufacturer and full name. Let's do this transformation.

In [None]:
import pandas as pd

# Filepath
file_path = '/kaggle/input/cars-catalogue-main-raw/Cars catalogue Main_RAW.xlsx'

# Load the dataset
data = pd.read_excel(file_path)

# Melt the DataFrame to reformat it into two columns: 'Model_Manufacturer_name' and 'Collectible_item_full_name'
melted_data = data.melt(var_name='Model_Manufacturer_name', value_name='Collectible_item_full_name')

# Remove rows where 'Collectible_item_full_name' is null as they do not provide useful information
cleaned_data = melted_data.dropna(subset=['Collectible_item_full_name'])

# Show the first few rows of the newly structured DataFrame and summary information
print("Summary Information of Cleaned Data:")
cleaned_data_info = cleaned_data.info()

print("\nFirst Few Rows of Cleaned Data:")
cleaned_data_head = cleaned_data.head()

# If you need to save or export the cleaned data:
cleaned_data.to_csv('/kaggle/working/cleaned_data_v1_2columns.csv', index=False)

# Display the cleaned data head
display(cleaned_data_head)


The dataset has been successfully transformed into two columns: **'Model_Manufacturer_name'** and **'Collectible_item_full_name'**. Now, we have **4,045 entries**, each representing a specific collectible item, where the 'Model_Manufacturer_name' column identifies the collectible manufacturer or publishing series, and 'Collectible_item_full_name' includes detailed descriptions like *brand name*, *model*, and somtimes *year* and *color*.

## 2.1 Extracting Relevant Information

Now we need to extract the brand, model and if possible - the year of the item. The cleaned data will have to be structured in the following way: 

* **Model_Manufacturer_name:** The manufacturer or series name.
* **Collectible_item_full_name:** The full name of the collectible item, as originally listed.
* **Brand:** The extracted brand name of the collectible item.
* **Model:** The extracted model description of the collectible item.
* **Year:** The extracted year or year range of the collectible item.
* **Color:** The extracted color of the collectible item.

In [None]:
import pandas as pd
import re
from IPython.display import display

# Read the dataset
data = pd.read_excel('/kaggle/input/cars-catalogue-main-raw/Cars catalogue Main_RAW.xlsx')

# Function to extract brand, model, and year from the collectible item full name
def extract_info(text):
    pattern = r'^(?P<Brand>[\w\s-]+?)\s(?P<Model>.*?)\s(?P<Year>\d{4}(?:-\d{4})?)'
    match = re.match(pattern, str(text))
    if match:
        return match.group('Brand').strip(), match.group('Model').strip(), match.group('Year')
    else:
        return None, None, None

# Initialize an empty list to store the reshaped data
reshaped_rows = []

# Iterate through each column of the original DataFrame
for col in data.columns:
    # Extract manufacturer name from the column name
    manufacturer_name = col.split('_')[0]
    # Iterate through each item in the column
    for item in data[col]:
        # Check if the item is not empty
        if pd.notna(item):
            brand, model, year = extract_info(item)
            # Append a new row to the reshaped data list
            reshaped_rows.append({'Model_Manufacturer_name': manufacturer_name,
                                  'Collectible_item_full_name': item,
                                  'Brand': brand, 'Model': model, 'Year': year})

# Convert the list of dictionaries to a DataFrame
reshaped_data = pd.DataFrame(reshaped_rows)

# Filter for valid entries where 'Model_Manufacturer_name' is not missing
valid_data = reshaped_data[reshaped_data['Model_Manufacturer_name'].notna()]

# Count missing 'Brand' values in the valid data
missing_brands_count = valid_data[valid_data['Brand'].isnull()].shape[0]
print(f"Number of entries with missing 'Brand', excluding missing 'Model_Manufacturer_name': {missing_brands_count}")

# Count missing 'Year' values in the valid data
missing_years_count = valid_data[valid_data['Year'].isnull()].shape[0]
print(f"Number of entries with missing 'Year', excluding missing 'Model_Manufacturer_name': {missing_years_count}")

# If you need to save or export the cleaned data:
cleaned_data.to_csv('/kaggle/working/cleaned_data_v2_multiCol_1675-missing-brand.csv', index=False)

# Display the reshaped data
display(reshaped_data)

## 2.2 Filtering missing values

So far it looks good, however, after deeper look into the data, we can see that the regular expression used in the cleaning script successfully parsed many entries, but 1675 entries are missing values in the 'Brand', 'Model', and 'Year' columns. This is due to even more variations in how the item names are formatted, which the regular expression didn't capture. This requires filtering out all entries with missing values and searching for other criteria to update the regular expression as required for the task.

### Filtered data:

In [None]:
import pandas as pd
import re
from IPython.display import display

# Read the dataset
data = pd.read_excel('/kaggle/input/cars-catalogue-main-raw/Cars catalogue Main_RAW.xlsx')

# Function to extract brand, model, and year from the collectible item full name
def extract_info(text):
    pattern = r'^(?P<Brand>[\w\s-]+?)\s(?P<Model>.*?)\s(?P<Year>\d{4}(?:-\d{4})?)'
    match = re.match(pattern, str(text))
    if match:
        return match.group('Brand').strip(), match.group('Model').strip(), match.group('Year')
    else:
        return None, None, None

# Initialize an empty list to store the reshaped data
reshaped_rows = []

# Iterate through each column of the original DataFrame
for col in data.columns:
    # Extract manufacturer name from the column name
    manufacturer_name = col.split('_')[0]
    # Iterate through each item in the column
    for item in data[col]:
        # Check if the item is not empty
        if pd.notna(item):
            brand, model, year = extract_info(item)
            # Append a new row to the reshaped data list
            reshaped_rows.append({'Model_Manufacturer_name': manufacturer_name,
                                  'Collectible_item_full_name': item,
                                  'Brand': brand, 'Model': model, 'Year': year})

# Convert the list of dictionaries to a DataFrame
reshaped_data = pd.DataFrame(reshaped_rows)

# Filter for valid entries where 'Model_Manufacturer_name' is not missing
valid_data = reshaped_data[reshaped_data['Model_Manufacturer_name'].notna()]

# Count missing 'Brand' values in the valid data
missing_brands_count = valid_data[valid_data['Brand'].isnull()].shape[0]
print(f"Number of entries with missing 'Brand', excluding missing 'Model_Manufacturer_name': {missing_brands_count}")

# Count missing 'Year' values in the valid data
missing_years_count = valid_data[valid_data['Year'].isnull()].shape[0]
print(f"Number of entries with missing 'Year', excluding missing 'Model_Manufacturer_name': {missing_years_count}")

# Filter entries without a brand
entries_without_brand = valid_data[valid_data['Brand'].isnull()]

# Display entries without a brand
print("Entries without a brand:")
display(entries_without_brand)

# If you need to save or export the cleaned data:
cleaned_data.to_csv('/kaggle/working/cleaned_data_v2_list_missing-brand-only.csv', index=False)



Now we have all entries with missing values for 'Brand', 'Model' and 'Year', neatly listed. Still it would be a time-consuming and not very efficient task to sample many of those 1675 entries or handle them directly by hand, so in order to save time we will export the list into a file for ChatGPT4 to quickly analyze and figure out more variations that could be captured by the regular expression, thus reducing the missing values to absolute minimum.

## 2.2 Reducing the missing values

After the analysis we have the new variations included into the code.

In [None]:
import pandas as pd
import re
from IPython.display import display

# Read the dataset
data = pd.read_excel('/kaggle/input/cars-catalogue-main-raw/Cars catalogue Main_RAW.xlsx')

# Function to extract brand, model, and year, with enhanced regex
def extract_info_improved(text):
    # Normalize the text by removing known noise patterns and handling edge cases
    text = str(text)
    text = re.sub(r'\s+\(.*?\)', '', text)  # Remove any content inside parentheses
    text = re.sub(r'\s-\s.*', '', text)     # Remove descriptions after a dash
    text = re.sub(r"\b(?<!\d)(?!\d{4})\d+\b", "", text)  # Remove isolated numbers that are not part of a four-digit year
    text = text.replace(',', '')  # Remove commas that might be used as separators

    # Enhanced regex pattern to handle various cases
    pattern = (
        r'^(?P<Brand>\D+?)'             # Capture the brand as non-digit characters at the start
        r'\s+(?P<Model>.*?)'            # Capture the model which might include numbers
        r'(\s+(?P<Year>\d{4}))?$'       # Optionally capture a four-digit year at the end
    )
    match = re.match(pattern, text)
    if match:
        brand = match.group('Brand').strip()
        model = match.group('Model').strip()
        year = match.group('Year') if match.group('Year') else None
        return brand, model, year
    return None, None, None

# Process each item, reshape data
reshaped_rows = []
for col in data.columns:
    manufacturer_name = col.split('_')[0]
    for item in data[col]:
        if pd.notna(item):
            brand, model, year = extract_info_improved(item)
            reshaped_rows.append({
                'Model_Manufacturer_name': manufacturer_name,
                'Collectible_item_full_name': item,
                'Brand': brand, 'Model': model, 'Year': year
            })

# Convert reshaped_rows into a DataFrame
reshaped_data = pd.DataFrame(reshaped_rows)

# Filter for valid entries where 'Model_Manufacturer_name' is not empty
valid_data = reshaped_data[reshaped_data['Model_Manufacturer_name'].notna()]

# Count and print missing 'Brand' and 'Year' values excluding missing 'Model_Manufacturer_name'
missing_brands_count = valid_data[valid_data['Brand'].isnull()].shape[0]
missing_years_count = valid_data[valid_data['Year'].isnull()].shape[0]
print(f"Number of entries with missing 'Brand', excluding missing 'Model_Manufacturer_name': {missing_brands_count}")
print(f"Number of entries with missing 'Year', excluding missing 'Model_Manufacturer_name': {missing_years_count}")

# Convert to DataFrame, export, and display
#output_filename = '/kaggle/working/cleaned_cars_catalogue_v3.xlsx'
#reshaped_data.to_excel(output_filename, index=False)
display(reshaped_data.head())


The entries with missing 'Brand' have now been reduced to only 292

In [None]:
import pandas as pd
import re
from IPython.display import display

# Read the dataset
data = pd.read_excel('/kaggle/input/cars-catalogue-main-raw/Cars catalogue Main_RAW.xlsx')

# Common color names for regex (this list can be extended)
colors = "black|white|red|green|blue|yellow|silver|grey|orange|purple|gold|bronze|brown"

# Function to extract brand, model, year, and color, with enhanced regex
def extract_info_improved(text):
    # Normalize the text by removing known noise patterns and handling edge cases
    text = str(text)
    text = re.sub(r'\s+\(.*?\)', '', text)  # Remove any content inside parentheses
    text = re.sub(r'\s-\s.*', '', text)     # Remove descriptions after a dash
    text = re.sub(r"\b(?<!\d)(?!\d{4})\d+\b", "", text)  # Remove isolated numbers that are not part of a four-digit year
    text = text.replace(',', '')  # Remove commas that might be used as separators

    # Enhanced regex pattern to handle various cases including color
    pattern = (
        rf'^(?P<Brand>\D+?)'             # Capture the brand as non-digit characters at the start
        r'\s+(?P<Model>.*?)'             # Capture the model which might include numbers
        r'(\s+(?P<Year>\d{{4}}))?'       # Optionally capture a four-digit year at the end
        rf'(\s+(?P<Color>{colors}))?$'   # Optionally capture a color at the end
    )
    match = re.match(pattern, text)
    if match:
        brand = match.group('Brand').strip()
        model = match.group('Model').strip()
        year = match.group('Year') if match.group('Year') else None
        color = match.group('Color') if 'Color' in match.groupdict() and match.group('Color') else None
        return brand, model, year, color
    return None, None, None, None

# Process each item, reshape data
reshaped_rows = []
for col in data.columns:
    manufacturer_name = col.split('_')[0]
    for item in data[col]:
        if pd.notna(item):
            brand, model, year, color = extract_info_improved(item)
            reshaped_rows.append({
                'Model_Manufacturer_name': manufacturer_name,
                'Collectible_item_full_name': item,
                'Brand': brand,
                'Model': model,
                'Year': year,
                'Color': color
            })

# Convert reshaped_rows into a DataFrame
reshaped_data = pd.DataFrame(reshaped_rows)

# Filter for valid entries where 'Model_Manufacturer_name' is not empty
valid_data = reshaped_data[reshaped_data['Model_Manufacturer_name'].notna()]

# Count and print missing 'Brand' and 'Year' values excluding missing 'Model_Manufacturer_name'
missing_brands_count = valid_data[valid_data['Brand'].isnull()].shape[0]
missing_years_count = valid_data[valid_data['Year'].isnull()].shape[0]
print(f"Number of entries with missing 'Brand', excluding missing 'Model_Manufacturer_name': {missing_brands_count}")
print(f"Number of entries with missing 'Year', excluding missing 'Model_Manufacturer_name': {missing_years_count}")

# Display the head of the DataFrame to confirm the extraction
display(reshaped_data.head())


In [None]:
import pandas as pd
import re
from IPython.display import display

# Read the dataset
data = pd.read_excel('/kaggle/input/cars-catalogue-main-raw/Cars catalogue Main_RAW.xlsx')

# Common color names for regex (this list can be extended)
colors = "black|white|red|green|blue|yellow|silver|grey|orange|purple|gold|bronze|brown"

# Function to extract brand, model, year, and color, with enhanced regex
def extract_info_improved(text):
    # Normalize the text by removing known noise patterns and handling edge cases
    text = str(text)
    text = re.sub(r'\s+\(.*?\)', '', text)  # Remove any content inside parentheses
    text = re.sub(r'\s-\s.*', '', text)     # Remove descriptions after a dash
    text = re.sub(r"\b(?<!\d)(?!\d{4})\d+\b", "", text)  # Remove isolated numbers that are not part of a four-digit year
    text = text.replace(',', '')  # Remove commas that might be used as separators

    # Enhanced regex pattern to handle various cases including color
    pattern = (
        rf'^(?P<Brand>\D+?)'             # Capture the brand as non-digit characters at the start
        r'\s+(?P<Model>.*?)'             # Capture the model which might include numbers
        r'(\s+(?P<Year>\d{{4}}))?'       # Optionally capture a four-digit year at the end
        rf'(\s+(?P<Color>{colors}))?$'   # Optionally capture a color at the end
    )
    match = re.match(pattern, text)
    if match:
        brand = match.group('Brand').strip()
        model = match.group('Model').strip()
        year = match.group('Year') if match.group('Year') else None
        color = match.group('Color') if 'Color' in match.groupdict() and match.group('Color') else None
        return brand, model, year, color
    return None, None, None, None

# Process each item, reshape data
reshaped_rows = []
for col in data.columns:
    manufacturer_name = col.split('_')[0]
    for item in data[col]:
        if pd.notna(item):
            brand, model, year, color = extract_info_improved(item)
            reshaped_rows.append({
                'Model_Manufacturer_name': manufacturer_name,
                'Collectible_item_full_name': item,
                'Brand': brand,
                'Model': model,
                'Year': year,
                'Color': color
            })

# Convert reshaped_rows into a DataFrame
reshaped_data = pd.DataFrame(reshaped_rows)

# Filter for valid entries where 'Model_Manufacturer_name' is not empty
valid_data = reshaped_data[reshaped_data['Model_Manufacturer_name'].notna()]

# Count and print missing 'Brand', 'Year', and 'Color' values excluding missing 'Model_Manufacturer_name'
missing_brands_count = valid_data[valid_data['Brand'].isnull()].shape[0]
missing_years_count = valid_data[valid_data['Year'].isnull()].shape[0]
missing_colors_count = valid_data[valid_data['Color'].isnull()].shape[0]
print(f"Number of entries with missing 'Brand', excluding missing 'Model_Manufacturer_name': {missing_brands_count}")
print(f"Number of entries with missing 'Year', excluding missing 'Model_Manufacturer_name': {missing_years_count}")
print(f"Number of entries with missing 'Color', excluding missing 'Model_Manufacturer_name': {missing_colors_count}")

# Display the head of the DataFrame to confirm the extraction
display(reshaped_data.head())


# Adding id to table with more years in it and no colors

In [None]:
import pandas as pd
import re
from IPython.display import display

# Simulating the loading of the dataset
data = pd.read_excel('/kaggle/input/cars-catalogue-main-raw/Cars catalogue Main_RAW.xlsx')

# Function to extract brand, model, and year with robust error handling
def extract_info_improved(text):
    text = str(text)  # Ensure text is a string
    pattern = (
        r'^(?P<Brand>\D+?)'             
        r'\s+(?P<Model>.*?)'            
        r'(\s+(?P<Year>\d{4}))?$'      
    )
    match = re.match(pattern, text)
    if match:
        brand = match.group('Brand').strip() if match.group('Brand') else None
        model = match.group('Model').strip() if match.group('Model') else None
        year = match.group('Year') if match.group('Year') else None
        return brand, model, year
    else:
        return None, None, None  # Ensure always returning three items

# Initialize counter for unique ID generation
counter = 1

# Process each item, reshape data
reshaped_rows = []
for col in data.columns:
    manufacturer_name = col.split('_')[0]
    for item in data[col]:
        if pd.notna(item):
            brand, model, year = extract_info_improved(item)
            reshaped_rows.append({
                'Model_Manufacturer_name': manufacturer_name,
                'Collectible_item_full_name': item,
                'Brand': brand, 'Model': model, 'Year': year,
                'id': f"{manufacturer_name}-{item}-{counter:04d}"
            })
            counter += 1

# Convert to DataFrame
df_a = pd.DataFrame(reshaped_rows)

# Reporting missing data and total entries
total_entries = len(df_a)
missing_brand = df_a['Brand'].isnull().sum()
missing_year = df_a['Year'].isnull().sum()

print(f"Total number of entries: {total_entries}")
print(f"Number of entries with missing 'Brand': {missing_brand}")
print(f"Number of entries with missing 'Year': {missing_year}")

# Display the DataFrame to verify results
display(df_a.head())  # Modify this to display(df_a) if you want to see the entire DataFrame


# Adding id to table with more colors in it and less years

In [None]:
import pandas as pd
import re
from IPython.display import display

# Read the dataset
data = pd.read_excel('/kaggle/input/cars-catalogue-main-raw/Cars catalogue Main_RAW.xlsx')

# Common color names for regex (this list can be extended)
colors = "black|white|red|green|blue|yellow|silver|grey|orange|purple|gold|bronze|brown"

# Function to extract brand, model, year, and color, with enhanced regex
def extract_info_improved(text):
    # Normalize the text by removing known noise patterns and handling edge cases
    text = str(text)
    text = re.sub(r'\s+\(.*?\)', '', text)  # Remove any content inside parentheses
    text = re.sub(r'\s-\s.*', '', text)     # Remove descriptions after a dash
    text = re.sub(r"\b(?<!\d)(?!\d{4})\d+\b", "", text)  # Remove isolated numbers that are not part of a four-digit year
    text = text.replace(',', '')  # Remove commas that might be used as separators

    # Enhanced regex pattern to handle various cases including color
    pattern = (
        rf'^(?P<Brand>\D+?)'             # Capture the brand as non-digit characters at the start
        r'\s+(?P<Model>.*?)'             # Capture the model which might include numbers
        r'(\s+(?P<Year>\d{{4}}))?'       # Optionally capture a four-digit year at the end
        rf'(\s+(?P<Color>{colors}))?$'   # Optionally capture a color at the end
    )
    match = re.match(pattern, text)
    if match:
        brand = match.group('Brand').strip()
        model = match.group('Model').strip()
        year = match.group('Year') if match.group('Year') else None
        color = match.group('Color') if 'Color' in match.groupdict() and match.group('Color') else None
        return brand, model, year, color
    else:
        return None, None, None, None

# Initialize counter for unique ID generation
counter = 1

# Process each item, reshape data
reshaped_rows = []
for col in data.columns:
    manufacturer_name = col.split('_')[0]
    for item in data[col]:
        if pd.notna(item):
            brand, model, year, color = extract_info_improved(item)
            reshaped_rows.append({
                'Model_Manufacturer_name': manufacturer_name,
                'Collectible_item_full_name': item,
                'Brand': brand, 'Model': model, 'Year': year, 'Color': color,
                'id': f"{manufacturer_name}-{item}-{counter:04d}"
            })
            counter += 1

# Convert reshaped_rows into a DataFrame
df_b = pd.DataFrame(reshaped_rows)

# Reporting missing data and total entries
total_entries = len(df_b)
missing_brand = df_b['Brand'].isnull().sum()
missing_year = df_b['Year'].isnull().sum()
missing_color = df_b['Color'].isnull().sum()

print(f"Total number of entries: {total_entries}")
print(f"Number of entries with missing 'Brand': {missing_brand}")
print(f"Number of entries with missing 'Year': {missing_year}")
print(f"Number of entries with missing 'Color': {missing_color}")

# Display the DataFrame to verify results
display(df_b.head())  # Modify this to display(df_b) if you want to see the entire DataFrame


# Merge & fill the gaps for year and color

In [None]:
import pandas as pd

# Assuming df_a and df_b have been previously defined and loaded as shown in the updated Code A and Code B

# Merge df_a and df_b using the 'id' field
# We take all columns from df_a and only the 'Color' column from df_b
merged_data = pd.merge(df_a, df_b[['id', 'Color']], on='id', how='left')

# Reorder the columns to place 'id' before 'Model_Manufacturer_name'
column_order = ['id', 'Model_Manufacturer_name'] + [col for col in merged_data.columns if col not in ['id', 'Model_Manufacturer_name']]
merged_data = merged_data[column_order]

# We already have the Year information filled in df_a as needed, so we just need to add Color information
# Color from df_b will overwrite the Color in df_a where it exists
merged_data['Color'] = merged_data['Color'].combine_first(merged_data['Color'])

# Now you have a DataFrame with all information merged where the 'Year' comes from df_a and 'Color' from df_b
# Let's calculate missing data statistics to ensure everything aligns with expectations
total_entries = len(merged_data)
missing_brand = merged_data['Brand'].isnull().sum()
missing_year = merged_data['Year'].isnull().sum()
missing_color = merged_data['Color'].isnull().sum()

print(f"Total number of entries: {total_entries}")
print(f"Number of entries with missing 'Brand', excluding missing 'Model_Manufacturer_name': {missing_brand}")
print(f"Number of entries with missing 'Year', excluding missing 'Model_Manufacturer_name': {missing_year}")
print(f"Number of entries with missing 'Color', excluding missing 'Model_Manufacturer_name': {missing_color}")


# Export the merged data to an Excel file
merged_data.to_excel('cleaned_data_cars_catalogue_v5_final.xlsx', index=False)

# Display the DataFrame to verify results
display(merged_data.head())  # Modify this to display(merged_data) if you want to see the entire DataFrame


In [None]:
# Sort the manufacturers by count from highest to lowest
manufacturer_counts_sorted = manufacturer_counts.sort_values(by='count_brand_models', ascending=False)

# Display the table of manufacturers and their respective item counts
print("Manufacturers and Counts (Ordered by Count, Highest to Lowest):")
display(manufacturer_counts_sorted)


In [None]:
# Filter the manufacturer counts DataFrame to include only manufacturers with names less than 3 characters long
manufacturers_short_names = manufacturer_counts_sorted[manufacturer_counts_sorted['Model_Manufacturer_name'].str.len() < 3]

# Display the list of manufacturers with names less than 3 characters long
print("Manufacturers with Names Less Than 3 Characters Long:")
display(manufacturers_short_names)


In [None]:
# Group the data by 'Brand' and count the number of occurrences of each brand
brand_counts = merged_data['Brand'].value_counts().reset_index()

# Rename the columns for clarity
brand_counts.columns = ['Brand_name', 'count_of_brand']

# Sort the DataFrame by 'count_of_brand' column in descending order
brand_counts_sorted = brand_counts.sort_values(by='count_of_brand', ascending=False)

# Display the table of brand names and their respective counts in descending order
print("Brand Names and Counts (Highest to Lowest):")
display(brand_counts_sorted)


In [None]:
import matplotlib.pyplot as plt

# Calculate the number of entries with colors and entries without colors
entries_with_color = merged_data['Color'].notnull().sum()
entries_without_color = merged_data['Color'].isnull().sum()

# Create a pie chart for color
labels_color = ['Entries with Color', 'Entries without Color']
sizes_color = [entries_with_color, entries_without_color]
colors_color = ['#ff9999', '#66b3ff']
explode_color = (0.1, 0)  # explode the 1st slice (Entries with Color)

# Calculate the number of entries with years and entries without years
entries_with_year = merged_data['Year'].notnull().sum()
entries_without_year = merged_data['Year'].isnull().sum()

# Create a pie chart for year
labels_year = ['Entries with Year', 'Entries without Year']
sizes_year = [entries_with_year, entries_without_year]
colors_year = ['#ff9999', '#66b3ff']
explode_year = (0.1, 0)  # explode the 1st slice (Entries with Year)

# Create subplots
fig, axs = plt.subplots(1, 2, figsize=(12, 6))

# Plot pie chart for color
axs[0].pie(sizes_color, explode=explode_color, labels=labels_color, colors=colors_color, autopct='%1.1f%%', shadow=True, startangle=90)
axs[0].set_title('Percentage of Entries with Color vs Entries without Color')

# Plot pie chart for year
axs[1].pie(sizes_year, explode=explode_year, labels=labels_year, colors=colors_year, autopct='%1.1f%%', shadow=True, startangle=90)
axs[1].set_title('Percentage of Entries with Year vs Entries without Year')

plt.show()


In [None]:
import pandas as pd

# Assuming 'merged_data' is your DataFrame containing the data
# Extract unique brands and sort alphabetically
unique_brands = merged_data['Brand'].dropna().unique()
unique_brands_sorted = sorted(unique_brands)

# Display the total number of unique brands
total_unique_brands = len(unique_brands_sorted)
print(f"Total number of unique brand names: {total_unique_brands}")

# Display the sorted unique brands
print("\nUnique Brands (Alphabetically sorted):")
for brand in unique_brands_sorted:
    print(brand)


In [None]:
from fuzzywuzzy import fuzz
from itertools import combinations

# List to store potential typos
potential_typos = []

# Iterate through combinations of brands
for brand1, brand2 in combinations(unique_brands_sorted, 2):
    # Compute Levenshtein distance between pairs of brands
    similarity_score = fuzz.ratio(brand1.lower(), brand2.lower())
    # If similarity score is above a certain threshold, consider them potential typos
    if similarity_score > 80:  # You can adjust this threshold as needed
        potential_typos.append((brand1, brand2, similarity_score))

# Print potential typos
print("Potential Typos:")
for typo_pair in potential_typos:
    print(f"{typo_pair[0]} - {typo_pair[1]} (Similarity Score: {typo_pair[2]})")


In [None]:
import pandas as pd

# Define the correction dictionary with wrong and correct spellings
corrections = {
    "ALFA": "Alfa",
    "ASTON": "Aston",
    "AWZ": "AWZ",
    "Autobianch": "Autobianchi",
    "BUGATTI": "Bugatti",
    "Betliet": "Berliet",
    "Brbham": "Brabham",
    "CHEVROLET": "Chevrolet",
    "CHRYSLER": "Chrysler",
    "CITROËN": "Citroën",
    "Cadillac": "Cadillac",
    "Citroen": "Citroën",
    "Duesemberg": "Duesenberg",
    "FERRARI": "Ferrari",
    "FIAT": "Fiat",
    "FORD": "Ford",
    "GAZ": "GAZ",
    "HUMMER": "Hummer",
    "ISO": "ISO",
    "ISUZU": "Isuzu",
    "Iveco": "IVECO",
    "LAMBORGHINI": "Lamborghini",
    "LOLA": "Lola",
    "MERCEDES-BENZ": "Mercedes-Benz",
    "Mercede-Benz": "Mercedes-Benz",
    "MINI": "Mini",
    "Maserati": "Maserati",
    "McLaren": "McLaren",
    "Moskvitch": "Москвич",
    "Moskwitch": "Москвич",
    "Oldsmobil": "Oldsmobile",
    "PANHARD": "Panhard",
    "PORSCHE": "Porsche",
    "Plimouth": "Plymouth",
    "RENAULT": "Renault",
    "Red": "Red",
    "Saab": "SAAB",
    "SAVA": "Sava",
    "SAVIEM": "Saviem",
    "SEAT": "Seat",
    "SIAM": "Siam",
    "SIMCA": "Simca",
    "ЗИС": "ЗИС"
}

# Apply corrections to the 'Model_Manufacturer_name' column
merged_data['Model_Manufacturer_name'] = merged_data['Model_Manufacturer_name'].replace(corrections)

# Reorder the columns to place 'id' before 'Model_Manufacturer_name'
column_order = ['id', 'Model_Manufacturer_name'] + [col for col in merged_data.columns if col not in ['id', 'Model_Manufacturer_name']]
merged_data = merged_data[column_order]

# Now you have a DataFrame with corrected brand names
# Let's calculate missing data statistics to ensure everything aligns with expectations
total_entries = len(merged_data)
missing_brand = merged_data['Brand'].isnull().sum()
missing_year = merged_data['Year'].isnull().sum()
missing_color = merged_data['Color'].isnull().sum()

print(f"Total number of entries: {total_entries}")
print(f"Number of entries with missing 'Brand', excluding missing 'Model_Manufacturer_name': {missing_brand}")
print(f"Number of entries with missing 'Year', excluding missing 'Model_Manufacturer_name': {missing_year}")
print(f"Number of entries with missing 'Color', excluding missing 'Model_Manufacturer_name': {missing_color}")

# Export the merged data to an Excel file
merged_data.to_excel('cleaned_data_cars_catalogue_v5.1_final.xlsx', index=False)

# Display the DataFrame to verify results
display(merged_data.head())  # Modify this to display(merged_data) if you want to see the entire DataFrame


In [None]:
import pandas as pd
import uuid  # Importing the UUID module to generate unique IDs

# Assuming df_a and df_b have been previously defined and loaded as shown in the updated Code A and Code B

# Merge df_a and df_b using the 'id' field
# We take all columns from df_a and only the 'Color' column from df_b
merged_data = pd.merge(df_a, df_b[['id', 'Color']], on='id', how='left')

# Reorder the columns to place 'id' before 'Model_Manufacturer_name'
column_order = ['id', 'Model_Manufacturer_name'] + [col for col in merged_data.columns if col not in ['id', 'Model_Manufacturer_name']]
merged_data = merged_data[column_order]

# We already have the Year information filled in df_a as needed, so we just need to add Color information
# Color from df_b will overwrite the Color in df_a where it exists
merged_data['Color'] = merged_data['Color'].combine_first(merged_data['Color'])

# Now you have a DataFrame with all information merged where the 'Year' comes from df_a and 'Color' from df_b

# Generate unique IDs using UUID
total_entries = len(merged_data)
merged_data['Unique_ID'] = [str(uuid.uuid4()) for _ in range(total_entries)]

# Let's calculate missing data statistics to ensure everything aligns with expectations
missing_brand = merged_data['Brand'].isnull().sum()
missing_year = merged_data['Year'].isnull().sum()
missing_color = merged_data['Color'].isnull().sum()

print(f"Total number of entries: {total_entries}")
print(f"Number of entries with missing 'Brand', excluding missing 'Model_Manufacturer_name': {missing_brand}")
print(f"Number of entries with missing 'Year', excluding missing 'Model_Manufacturer_name': {missing_year}")
print(f"Number of entries with missing 'Color', excluding missing 'Model_Manufacturer_name': {missing_color}")

# Export the merged data to an Excel file
merged_data.to_excel('cleaned_data_cars_catalogue_v5_final_with_ids.xlsx', index=False)

# Display the DataFrame to verify results
display(merged_data.head())  # Modify this to display(merged_data) if you want to see the entire DataFrame
