<H1 align="center">  Google Data Analytics Certificate </H1>

---

# Case Study 2: How Can a Wellness Technology Company Play It Smart?

In this scenario, we are working on the marketing analyst team at [Bellabeat](https://bellabeat.com/), a high-tech manufacturer of health-focused products for women. Urška Sršen, cofounder and Chief Creative Oﬃcer of **Bellabeat**, believes that analyzing smart device ﬁtness data could help unlock new growth opportunities for the company.

The dataset was obtained from Kaggle at https://www.kaggle.com/datasets/shreyaspj/android-devices-and-mobiles/data. According to the author, the dataset was scrapped online from the e-commerce website Flipkart. It consists of the **market performance metrics** (price, rating, and reviews) and various **device specifications** (battery, camera, display etc) of 984 android devices sold on the website. 

This notebook analyses the association between each of the device specifications and performance metrics. A summary of the main findings is provided.

In [184]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image
import re

In [185]:
df = pd.read_csv("mobiles1.csv")
df.head()

Unnamed: 0,battery,camera,display,memory,name,price,processor,rating,reviews,warranty
0,5000 mAh Battery,12MP + 2MP | 8MP Front Camera,15.8 cm (6.22 inch) HD+ Display,4 GB RAM | 64 GB ROM | Expandable Upto 512 GB,"Redmi 8 (Ruby Red, 64 GB)",9999,Qualcomm Snapdragon 439 Processor,4.4,"55,078 Reviews",Brand Warranty of 1 Year Available for Mobile ...
1,5000 mAh Battery,12MP + 8MP + 2MP + 2MP | 8MP Front Camera,16.56 cm (6.52 inch) HD+ Display,4 GB RAM | 64 GB ROM,"Realme 5i (Aqua Blue, 64 GB)",10999,Qualcomm Snapdragon 665 2 GHz Processor,4.5,"20,062 Reviews",Sunrise Design
2,5000 mAh Battery,12MP + 8MP + 2MP + 2MP | 8MP Front Camera,16.56 cm (6.52 inch) HD+ Display,4 GB RAM | 128 GB ROM,"Realme 5i (Aqua Blue, 128 GB)",11999,Qualcomm Snapdragon 665 (2 GHz) Processor,4.5,"20,062 Reviews",Sunrise Design
3,5000 mAh Battery,12MP + 8MP + 2MP + 2MP | 8MP Front Camera,16.56 cm (6.52 inch) HD+ Display,4 GB RAM | 128 GB ROM,"Realme 5i (Forest Green, 128 GB)",11999,Qualcomm Snapdragon 665 (2 GHz) Processor,4.5,"20,062 Reviews",Sunrise Design
4,4000 mAh Battery,13MP + 2MP | 5MP Front Camera,15.49 cm (6.1 inch) HD+ Display,3 GB RAM | 32 GB ROM | Expandable Upto 256 GB,"Realme C2 (Diamond Blue, 32 GB)",7499,MediaTek P22 Octa Core 2.0 GHz Processor,4.4,"10,091 Reviews",Dual Nano SIM slots and Memory Card Slot


In [186]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 984 entries, 0 to 983
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   battery    984 non-null    object 
 1   camera     984 non-null    object 
 2   display    984 non-null    object 
 3   memory     984 non-null    object 
 4   name       984 non-null    object 
 5   price      984 non-null    int64  
 6   processor  983 non-null    object 
 7   rating     971 non-null    float64
 8   reviews    971 non-null    object 
 9   warranty   836 non-null    object 
dtypes: float64(1), int64(1), object(8)
memory usage: 77.0+ KB


####

## **🧹 1 Data Wrangling** 

### **1.1 Removing missing values and anomalies**

First, we find out if there are missing values in the data. We remove the rows containing missing values then recalculate the missing count and number of rows remaining to double check.

In [187]:
# obtain the number of missing data points per column
missing_values_count = df.isnull().sum()
missing_values_count[:]

battery        0
camera         0
display        0
memory         0
name           0
price          0
processor      1
rating        13
reviews       13
warranty     148
dtype: int64

In [188]:
df = df.dropna()

# Recalculate missing values count
missing_values_count = df.isnull().sum()
print(missing_values_count)

# Double check number of rows remaining 
print()
print(df.shape[0])

battery      0
camera       0
display      0
memory       0
name         0
price        0
processor    0
rating       0
reviews      0
warranty     0
dtype: int64

828


The warranty column contains many anomalies (see first 15 rows below). A Google search on Flipkart suggests the author made an error during web scraping, as many of the odd values in the column likely belonged to the **highlights** section of the devices instead. 

For one of many examples, see **highlights** section for the device *realme C2 (Diamond Blue, 32 GB)* at its Flipkart store page: https://shorturl.at/1TMgr. One of the device's highlights were *Dual Nano SIM slots and Memory Card Slot*, which was its input for the warranty column.

Unfortunately, there are far too many such errors so the warranty column will be dropped. 

In [189]:
# First 15 rows of warranty
df["warranty"].head(15)

0     Brand Warranty of 1 Year Available for Mobile ...
1                                        Sunrise Design
2                                        Sunrise Design
3                                        Sunrise Design
4              Dual Nano SIM slots and Memory Card Slot
5              Dual Nano SIM slots and Memory Card Slot
6              Dual Nano SIM slots and Memory Card Slot
7              Dual Nano SIM slots and Memory Card Slot
8                                        Sunrise Design
9              Dual Nano SIM slots and Memory Card Slot
10    Brand Warranty of 1 Year Available for Mobile ...
11    Brand Warranty of 1 Year Available for Mobile ...
12                                   18 W Fast Charging
13    Brand Warranty of 1 Year Available for Mobile ...
14    Brand Warranty of 1 Year Available for Mobile ...
Name: warranty, dtype: object

In [190]:
# Drop warranty column
df = df.drop(columns=["warranty"])

# Display first 5 rows to confirm column has been removed
df.head()

Unnamed: 0,battery,camera,display,memory,name,price,processor,rating,reviews
0,5000 mAh Battery,12MP + 2MP | 8MP Front Camera,15.8 cm (6.22 inch) HD+ Display,4 GB RAM | 64 GB ROM | Expandable Upto 512 GB,"Redmi 8 (Ruby Red, 64 GB)",9999,Qualcomm Snapdragon 439 Processor,4.4,"55,078 Reviews"
1,5000 mAh Battery,12MP + 8MP + 2MP + 2MP | 8MP Front Camera,16.56 cm (6.52 inch) HD+ Display,4 GB RAM | 64 GB ROM,"Realme 5i (Aqua Blue, 64 GB)",10999,Qualcomm Snapdragon 665 2 GHz Processor,4.5,"20,062 Reviews"
2,5000 mAh Battery,12MP + 8MP + 2MP + 2MP | 8MP Front Camera,16.56 cm (6.52 inch) HD+ Display,4 GB RAM | 128 GB ROM,"Realme 5i (Aqua Blue, 128 GB)",11999,Qualcomm Snapdragon 665 (2 GHz) Processor,4.5,"20,062 Reviews"
3,5000 mAh Battery,12MP + 8MP + 2MP + 2MP | 8MP Front Camera,16.56 cm (6.52 inch) HD+ Display,4 GB RAM | 128 GB ROM,"Realme 5i (Forest Green, 128 GB)",11999,Qualcomm Snapdragon 665 (2 GHz) Processor,4.5,"20,062 Reviews"
4,4000 mAh Battery,13MP + 2MP | 5MP Front Camera,15.49 cm (6.1 inch) HD+ Display,3 GB RAM | 32 GB ROM | Expandable Upto 256 GB,"Realme C2 (Diamond Blue, 32 GB)",7499,MediaTek P22 Octa Core 2.0 GHz Processor,4.4,"10,091 Reviews"


### **1.2 Separating multi-attribute columns and grouping similar categories**
Many columns seem to contain multiple pieces of information. For example, the name column includes the **model** (e.g., Redmi 8), the **colour** (e.g., Ruby Red) and the **ROM size** (e.g., 64GB); such attributes could be systematically separated into distinct columns for better clarity and analysis.

#### Battery

The battery type labels are inconsistent, with variations like Li-Ion and Li-Ion Polymer likely referring to the same technology within the data set. While these are technically distinct technologies, most modern smartphones use Li-Ion Polymer, making the distinction less meaningful in this context. Additionally, it is reasonable to assume that battery capacity is the primary factor consumers consider, whereas battery type has minimal impact on purchasing decisions. Therefore, **only battery capacity is retained for analysis**, while the ambiguous battery type will be dropped.

In [191]:
# Function to extract battery capacity (in mAh)
def extract_capacity(value):
    match = re.search(r"(\d+)\s*mAh", str(value))
    return int(match.group(1)) if match else None

# Create new column for battery capacity
df["Battery_Capacity (mAh)"] = df["battery"].apply(extract_capacity)
df["Battery_Capacity (mAh)"] = df["Battery_Capacity (mAh)"].astype("Int64")

# Drop the original battery column
df = df.drop(columns=["battery"])

df[["Battery_Capacity (mAh)"]].head(10)

Unnamed: 0,Battery_Capacity (mAh)
0,5000
1,5000
2,5000
3,5000
4,4000
5,4000
6,4000
7,4000
8,5000
9,4000


#### Camera

The camera column contains 4 pieces of information: number of lenses in back camera, number of megapixels for each lense in back camera, number of lenses in front camera, and number of megapixels for each lense in front camera. For ease of analysis, only the highest megapixels count will be extracted for both the front and back camera. 

**Note:** There are certain inputs like "*48MP + 8MP | 48MP(F2.0) + 8MP(Ultra Wide/F2.2) + TOF (Time-of-Flight) 3D-Depth Rotating Camera*" which includes additonal description beyond the number of lenses and megapixels. The approach taken is to ignore these extra descriptions since there were only a few of them.

In [192]:
def extract_camera_info(camera_str):
    # Split into back and front camera using '|'
    parts = camera_str.split('|')
    
    back_camera = parts[0].strip()  # First part is back camera
    front_camera = parts[1].strip() if len(parts) > 1 else ""  # Second part is front camera if available

    # Function to extract highest megapixel count and number of lenses
    def get_camera_details(camera_text):
        # Find all occurrences of megapixels (e.g., 48MP, 8MP)
        megapixels = [int(mp) for mp in re.findall(r'(\d+)MP', camera_text)]
        highest_mp = max(megapixels) if megapixels else None  # Get the highest MP
        
        # Count the number of lenses (based on number of plus signs)
        num_lenses = camera_text.count('+') + 1 if '+' in camera_text else 1
        
        return highest_mp, num_lenses

    # Extract details
    back_mp, back_lenses = get_camera_details(back_camera)
    front_mp, front_lenses = get_camera_details(front_camera)

    return back_mp, back_lenses, front_mp, front_lenses

df[['Back_Camera_MP', 'Back_Camera_No._Lenses', 'Front_Camera_MP','Front_Camera_No._Lenses']] = df['camera'].apply(
    lambda x: pd.Series(extract_camera_info(str(x)))
)
df["Back_Camera_MP"] = df["Back_Camera_MP"].astype("Int64")
df["Back_Camera_No._Lenses"] = df["Back_Camera_No._Lenses"].astype("Int64")
df["Front_Camera_MP"] = df["Front_Camera_MP"].astype("Int64")
df["Front_Camera_No._Lenses"] = df["Front_Camera_No._Lenses"].astype("Int64")

df = df.drop(columns=["camera"])

df[['Back_Camera_MP', 'Back_Camera_No._Lenses', 'Front_Camera_MP','Front_Camera_No._Lenses']].head()

Unnamed: 0,Back_Camera_MP,Back_Camera_No._Lenses,Front_Camera_MP,Front_Camera_No._Lenses
0,12,2,8,1
1,12,4,8,1
2,12,4,8,1
3,12,4,8,1
4,13,2,5,1


#### Display

The display column contains 2 pieces of information: display size (in both cm and inches) and display resolution. Display size will only be kept in inches once extracted. 
Display resolution consists of 4 main types: HD, HD+, Full HD, and Full HD+. Any other resolutions will be classified as "unspecified"

**Note:** Similarly to the camera column, there are certain inputs which include additonal descriptions beyond the display resolution. These will be classified as "unspecified"

In [193]:
def extract_display_info(display_str):
    """Extract display size (in inches) and resolution type (HD+, HD, Full HD+, Full HD, Unspecified)."""
    display_size = None
    resolution = "Unspecified"  # Default value

    # Extract display size (e.g., 6.5 inch, 6.7", etc.)
    size_match = re.search(r'(\d+(\.\d+)?)\s*(inch|")', display_str, re.IGNORECASE)
    if size_match:
        display_size = float(size_match.group(1))  # Convert to float

    # Extract resolution type
    if re.search(r'full\s*hd\+', display_str, re.IGNORECASE):
        resolution = "Full HD+"
    elif re.search(r'full\s*hd', display_str, re.IGNORECASE):
        resolution = "Full HD"
    elif re.search(r'hd\+', display_str, re.IGNORECASE):
        resolution = "HD+"
    elif re.search(r'\bhd\b', display_str, re.IGNORECASE):  # Ensure "HD" is a separate word
        resolution = "HD"

    return display_size, resolution

# Apply function to extract display details
df[['Display_Size_(inches)', 'Resolution']] = df['display'].apply(
    lambda x: pd.Series(extract_display_info(str(x)))
)

df = df.drop(columns=["display"])

df[['Display_Size_(inches)', 'Resolution']].head()

Unnamed: 0,Display_Size_(inches),Resolution
0,6.22,HD+
1,6.52,HD+
2,6.52,HD+
3,6.52,HD+
4,6.1,HD+


#### Memory

The memory column consists of 3 pieces of information: RAM capacity, ROM capacity, and the maximum storage capacity that can be expanded. For ease of analysis, the maximum expandable storage capacity will simply be extracted as a boolean variable indicating whether expansion is possible; the size of the expansion will not be considered. 

In [194]:
# Function to extract RAM, ROM, and Expandable Storage (Boolean)
def extract_memory_features(memory_str):
    ram_match = re.search(r'(\d+)\s*GB\s*RAM', memory_str, re.IGNORECASE)
    rom_match = re.search(r'(\d+)\s*GB\s*ROM', memory_str, re.IGNORECASE)
    expandable_match = re.search(r'Expandable', memory_str, re.IGNORECASE)

    ram = int(ram_match.group(1)) if ram_match else None
    rom = int(rom_match.group(1)) if rom_match else None
    expandable = bool(expandable_match)  # Converts to True if found, otherwise False

    return pd.Series([ram, rom, expandable])

# Apply function to extract values
df[['RAM_Capacity_(GB)', 'ROM_Capacity_(GB)', 'ROM_Expandable']] = df['memory'].apply(extract_memory_features)

df["RAM_Capacity_(GB)"] = df["RAM_Capacity_(GB)"].astype("Int64")
df["ROM_Capacity_(GB)"] = df["ROM_Capacity_(GB)"].astype("Int64")
df["ROM_Expandable"] = df["ROM_Expandable"].astype(bool)  

# Drop the original memory column
df = df.drop(columns=['memory'])

df[['RAM_Capacity_(GB)', 'ROM_Capacity_(GB)', 'ROM_Expandable']].head()

Unnamed: 0,RAM_Capacity_(GB),ROM_Capacity_(GB),ROM_Expandable
0,4,64,True
1,4,64,False
2,4,128,False
3,4,128,False
4,3,32,True


#### Name

The name column 