# Laptop Pricing Dataset – Data Import & Initial Exploration
**Goal:** Import the Laptop Pricing Dataset, Explore it's structure and prepare it for cleaning & analysis phase


**Deliverables:** Imported dataset ready for deeper Cleaning & analysis


### Environment Setup & Data Import
#### Load essential Python libraries for data handling, visualization, and numerical operations. Import the dataset for analysis.

In [17]:
# Import core data science/analysis libraries
import pandas as pd                  # For data manipulation and analysis
import numpy as np                   # For numerical operations
import matplotlib.pyplot as plt      # For plotting/visualizations
import seaborn as sns                # For advanced and prettier visualizations

# Display settings for pandas (make outputs easier to read)
pd.set_option('display.max_columns', 100)   # Show up to 100 columns when printing a DataFrame
pd.set_option('display.float_format', lambda x: f'{x:,.2f}')  # Format floats with 2 decimals and commas

# Define the dataset URL (hosted on IBM Cloud)
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-Coursera/laptop_pricing_dataset_base.csv"

# Load the dataset directly from the URL into a pandas DataFrame
df = pd.read_csv(url)

df.to_csv("laptops.csv", index=False)

# Display the first 5 rows to preview the dataset structure
df.head()

Unnamed: 0,Acer,4,IPS Panel,2,1,5,35.56,1.6,8,256,1.6.1,978
0,Dell,3,Full HD,1,1,3,39.624,2.0,4,256,2.2,634
1,Dell,3,Full HD,1,1,7,39.624,2.7,8,256,2.2,946
2,Dell,4,IPS Panel,2,1,5,33.782,1.6,8,128,1.22,1244
3,HP,4,Full HD,2,1,7,39.624,1.8,8,256,1.91,837
4,Dell,3,Full HD,1,1,5,39.624,1.6,8,256,2.2,1016


### Custom Header Assignment
#### Reload the dataset with manually assigned headers to make the data columns meaningful (e.g., Manufacturer, Category, Screen, OS, CPU, RAM, etc.). This ensures clarity before analysis.

In [21]:
import pandas as pd

# Load the dataset without headers
# 'header=None' tells pandas: "this CSV does not have a header row, 
# so treat the first row as data instead of column names"
df = pd.read_csv("laptops.csv", header=None)

# Manually assign column names to the DataFrame
# These names describe the features of each laptop in the dataset
df.columns = [
    "Manufacturer",       # Brand or company (e.g., Dell, Apple)
    "Category",           # Type of laptop (e.g., Gaming, Ultrabook)
    "Screen",             # Screen technology/description
    "GPU",                # Graphics card details
    "OS",                 # Operating system (e.g., Windows, macOS)
    "CPU_core",           # Number of CPU cores
    "Screen_Size_inch",   # Display size in inches
    "CPU_frequency",      # Processor speed
    "RAM_GB",             # RAM capacity in gigabytes
    "Storage_GB_SSD",     # Storage capacity in GB (often SSD/HDD)
    "Weight_kg",          # Laptop weight in kilograms
    "Price"               # Laptop price
]

# Display the first 10 rows to confirm the headers are correctly applied
print(df.head(10))


  Manufacturer  Category     Screen  GPU  OS  CPU_core Screen_Size_inch  \
0         Acer         4  IPS Panel    2   1         5            35.56   
1         Dell         3    Full HD    1   1         3           39.624   
2         Dell         3    Full HD    1   1         7           39.624   
3         Dell         4  IPS Panel    2   1         5           33.782   
4           HP         4    Full HD    2   1         7           39.624   
5         Dell         3    Full HD    1   1         5           39.624   
6           HP         3    Full HD    3   1         5           39.624   
7         Acer         3  IPS Panel    2   1         5             38.1   
8         Dell         3    Full HD    1   1         5           39.624   
9         Acer         3  IPS Panel    3   1         7             38.1   

   CPU_frequency  RAM_GB  Storage_GB_SSD Weight_kg  Price  
0           1.60       8             256     1.6.1    978  
1           2.00       4             256       2.2    

### Missing Value Handling
#### Replace placeholder symbols (?) with proper NaN values to ensure accurate handling of missing data during analysis.

In [22]:
# Import NumPy (needed for np.nan)
import numpy as np

# Replace all occurrences of '?' with NaN (Not a Number)
# inplace=True means the changes are applied directly to df (no need to reassign)
df.replace('?', np.nan, inplace=True)

# Display the first 10 rows to confirm replacement worked
print(df.head(10))


  Manufacturer  Category     Screen  GPU  OS  CPU_core Screen_Size_inch  \
0         Acer         4  IPS Panel    2   1         5            35.56   
1         Dell         3    Full HD    1   1         3           39.624   
2         Dell         3    Full HD    1   1         7           39.624   
3         Dell         4  IPS Panel    2   1         5           33.782   
4           HP         4    Full HD    2   1         7           39.624   
5         Dell         3    Full HD    1   1         5           39.624   
6           HP         3    Full HD    3   1         5           39.624   
7         Acer         3  IPS Panel    2   1         5             38.1   
8         Dell         3    Full HD    1   1         5           39.624   
9         Acer         3  IPS Panel    3   1         7             38.1   

   CPU_frequency  RAM_GB  Storage_GB_SSD Weight_kg  Price  
0           1.60       8             256     1.6.1    978  
1           2.00       4             256       2.2    

### Data Type Inspection
#### Examines and prints the data types of each column in the dataset to understand the structure of the data and identify any necessary type conversions for analysis.

In [23]:
# Print data types of all columns
# The dtypes attribute returns a Series with the data type of each column
print(df.dtypes)

Manufacturer         object
Category              int64
Screen               object
GPU                   int64
OS                    int64
CPU_core              int64
Screen_Size_inch     object
CPU_frequency       float64
RAM_GB                int64
Storage_GB_SSD        int64
Weight_kg            object
Price                 int64
dtype: object


### Descriptive Statistics
#### Generate summary statistics for numerical feature and categorical features

In [24]:
# Statistical description for numeric columns
# describe() provides summary statistics for numerical columns by default
# Includes count, mean, std, min, quartiles, and max
print(df.describe())

# Statistical description including object (categorical) columns
# Using include='object' parameter to get statistics for categorical columns
# Includes count, unique values, top (most frequent), and frequency
print(df.describe(include='object'))

       Category    GPU     OS  CPU_core  CPU_frequency  RAM_GB  \
count    238.00 238.00 238.00    238.00         238.00  238.00   
mean       3.21   2.15   1.06      5.63           2.36    7.88   
std        0.78   0.64   0.24      1.24           0.41    2.48   
min        1.00   1.00   1.00      3.00           1.20    4.00   
25%        3.00   2.00   1.00      5.00           2.00    8.00   
50%        3.00   2.00   1.00      5.00           2.50    8.00   
75%        4.00   3.00   1.00      7.00           2.70    8.00   
max        5.00   3.00   2.00      7.00           2.90   16.00   

       Storage_GB_SSD    Price  
count          238.00   238.00  
mean           245.78 1,462.34  
std             34.77   574.61  
min            128.00   527.00  
25%            256.00 1,066.50  
50%            256.00 1,333.00  
75%            256.00 1,777.00  
max            256.00 3,810.00  
       Manufacturer   Screen Screen_Size_inch Weight_kg
count           238      238              234       

### Dataset Structure Overview
#### Provides comprehensive overview of dataframe structure and data quality.


In [25]:
# Print summary information of the dataset
# The info() method provides a concise summary of the DataFrame
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 238 entries, 0 to 237
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Manufacturer      238 non-null    object 
 1   Category          238 non-null    int64  
 2   Screen            238 non-null    object 
 3   GPU               238 non-null    int64  
 4   OS                238 non-null    int64  
 5   CPU_core          238 non-null    int64  
 6   Screen_Size_inch  234 non-null    object 
 7   CPU_frequency     238 non-null    float64
 8   RAM_GB            238 non-null    int64  
 9   Storage_GB_SSD    238 non-null    int64  
 10  Weight_kg         233 non-null    object 
 11  Price             238 non-null    int64  
dtypes: float64(1), int64(7), object(4)
memory usage: 22.4+ KB
None
