In [1]:
pip install pandas


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.3.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


# Pandas â€“ Loading and Cleaning Data
Today, we'll learn to use **Pandas**, the most essential tool in a Python data scientist's toolkit, to take a raw,
messy dataset and turn it into a clean, reliable foundation for analysis.


1.  The Building Blocks:** What are Pandas `Series` and `DataFrames`?
2.  Getting the Data:** Reading CSV files and first-look inspection.
3.  The Cleaning Workflow:**
    *   Handling Missing Values (`NaN`)
    *   Finding and Removing Duplicates
    *   Correcting Data Types and Formatting
4.  Exporting Our Clean Data.**

### 1. Building Blocks: `Series` and `Dataframe`

The two core data structures in Pandas.

*   **`Series`**: A one-dimensional labeled array, like a single column in a spreadsheet.
*   **`DataFrame`**: A two-dimensional labeled data structure with columns of potentially different types, like a full spreadsheet or an SQL table.

Let's quickly create them to see how they work. First, we need to import pandas. The standard convention is to import it as `pd`.

In [4]:
import pandas as pd
import numpy as np

a = [1,2,3,4,5]
pd.Series(a)

0    1
1    2
2    3
3    4
4    5
dtype: int64

### Creating a Series

A series like a list or a dictionary. It has an index and values
## Create a Series Object from a List
- A pandas **Series** is a one-dimensional labelled array.
- A Series combines the best features of a list and a dictionary.
- A Series maintains a single collection of ordered values (i.e. a single column of data).
- We can assign each value an identifier, which does not have to *be* unique.

In [5]:
ice_cream = ["Chocolate", "Vanilla", "Strawberry", "Rum Raisin"]
pd.Series(ice_cream)

0     Chocolate
1       Vanilla
2    Strawberry
3    Rum Raisin
dtype: object

In [11]:
my_dict={
"name":"Manish",
    "age":21,
    "address":"Khotang",
    "abc":"def",
    "adas":"asda"
}
pd.Series(my_dict)

name        Manish
age             21
address    Khotang
abc            def
adas          asda
dtype: object

In [17]:
lists= ["ram","shyam","abdul",24, True]
pd.Series(list)

0      ram
1    shyam
2    abdul
3       24
4     True
dtype: object

In [10]:
pd.Series(list,index=a)

1      ram
2    shyam
3    abdul
4       24
5     True
dtype: object

In [12]:
pd.Series(my_dict,index=a)

1    NaN
2    NaN
3    NaN
4    NaN
5    NaN
dtype: object

In [13]:
registrations = [True, False, False, False, True]
pd.Series(registrations)

0     True
1    False
2    False
3    False
4     True
dtype: bool

In [20]:
pd.Series(data=lists,index=a)

1      ram
2    shyam
3    abdul
4       24
5     True
dtype: object

In [22]:
# Creating a Series from a list
student_grades = pd.Series([85, 92, 78, 65, 95], index=['Alice','Bob','Charlie', 'Henry', 'Smith'])
print(student_grades)

Alice      85
Bob        92
Charlie    78
Henry      65
Smith      95
dtype: int64


In [23]:
# A Series can have a custom index
student_names = pd.Series(
    [85, 92, 78],
    index=['Alice', 'Bob', 'Charlie']
)
print("A Series :")
print(student_names)

A Series :
Alice      85
Bob        92
Charlie    78
dtype: int64


In [25]:
sushi = {
    "Salmon": "Orange",
    "Tuna": "Red",
    "Eel": "Brown"
}

pd.Series(sushi)

Salmon    Orange
Tuna         Red
Eel        Brown
dtype: object

### Creating a `DataFrame`

A `DataFrame` is the most common object you'll work with. It's a collection of Series. The most common way to create one from scratch is using a dictionary.


In [26]:
#Create a dictionary where keys are column names and values are lists(the column data)

data ={
    'StudentID': [101, 102, 103, 104],
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Score': [85, 92, 78, 88]
}
student_df =pd.DataFrame(data)
print("A Dataframe")
student_df

A Dataframe


Unnamed: 0,StudentID,Name,Score
0,101,Alice,85
1,102,Bob,92
2,103,Charlie,78
3,104,David,88


In [29]:
data ={
    'StudentID': [101, 102, 103, 104],
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Score': [85, 92, 78, 88]
}
student_df =pd.DataFrame(data,data.get('StudentID'))
print("A Dataframe")
student_df

A Dataframe


Unnamed: 0,StudentID,Name,Score
101,101,Alice,85
102,102,Bob,92
103,103,Charlie,78
104,104,David,88


In [43]:
data ={
    'StudentID': [101, 102, 103, 104],
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Score': [85, 92, 78, 88]
}
student_df =pd.DataFrame(data)

student_df = student_df.set_index("StudentID")
print("A Dataframe")
student_df

A Dataframe


Unnamed: 0_level_0,Name,Score
StudentID,Unnamed: 1_level_1,Unnamed: 2_level_1
101,Alice,85
102,Bob,92
103,Charlie,78
104,David,88


## 2. Reading Data & Initial Inspection

Manually creating DataFrames is rare. Most of the time, you'll load data from a file, most commonly a CSV (Comma-Separated Values) file.

We'll use the powerful `pd.read_csv()` function.

In [39]:
#Let's create a string containing our messy CSV data

messy_data_csv = """OrderID,OrderDate,Product,Price,Quantity,Region
1001,2023-01-05,Laptop,$100,2,North
1002,2023-01-07,Mouse,$25.50,5,South
1003,2023-01-10,Keyboard,,3,North
1004,2023-01-12,Monitor,$300,,"West"
1005,2023-01-15,Webcam,$45.99,1,East
1002,2023-01-07,Mouse,$25.50,5,South
1006,2023-01-18,,$15.00,2,East
1007,2023-01-20,Laptop,$1200.00,1, North
1008,2023-01-22,External HDD,$80,4,USA
"""

with open ('slaes_data_messy.csv','w') as f:
    f.write(messy_data_csv)
with open ('slaes_data_messy.csv','r') as f:
    print(f.read())


OrderID,OrderDate,Product,Price,Quantity,Region
1001,2023-01-05,Laptop,$100,2,North
1002,2023-01-07,Mouse,$25.50,5,South
1003,2023-01-10,Keyboard,,3,North
1004,2023-01-12,Monitor,$300,,"West"
1005,2023-01-15,Webcam,$45.99,1,East
1002,2023-01-07,Mouse,$25.50,5,South
1006,2023-01-18,,$15.00,2,East
1007,2023-01-20,Laptop,$1200.00,1, North
1008,2023-01-22,External HDD,$80,4,USA



In [41]:
df = pd.read_csv('slaes_data_messy.csv')
df

Unnamed: 0,OrderID,OrderDate,Product,Price,Quantity,Region
0,1001,2023-01-05,Laptop,$100,2.0,North
1,1002,2023-01-07,Mouse,$25.50,5.0,South
2,1003,2023-01-10,Keyboard,,3.0,North
3,1004,2023-01-12,Monitor,$300,,West
4,1005,2023-01-15,Webcam,$45.99,1.0,East
5,1002,2023-01-07,Mouse,$25.50,5.0,South
6,1006,2023-01-18,,$15.00,2.0,East
7,1007,2023-01-20,Laptop,$1200.00,1.0,North
8,1008,2023-01-22,External HDD,$80,4.0,USA


In [45]:
messy_data_csv = """
1001,2023-01-05,Laptop,$100,2,North
1002,2023-01-07,Mouse,$25.50,5,South
1003,2023-01-10,Keyboard,,3,North
1004,2023-01-12,Monitor,$300,,"West"
1005,2023-01-15,Webcam,$45.99,1,East
1002,2023-01-07,Mouse,$25.50,5,South
1006,2023-01-18,,$15.00,2,East
1007,2023-01-20,Laptop,$1200.00,1, North
1008,2023-01-22,External HDD,$80,4,USA
OrderID,OrderDate,Product,Price,Quantity,Region
"""

with open ('slaes_data_messy.csv','w') as f:
    f.write(messy_data_csv)

df = pd.read_csv('slaes_data_messy.csv')
df

Unnamed: 0,1001,2023-01-05,Laptop,$100,2,North
0,1002,2023-01-07,Mouse,$25.50,5,South
1,1003,2023-01-10,Keyboard,,3,North
2,1004,2023-01-12,Monitor,$300,,West
3,1005,2023-01-15,Webcam,$45.99,1,East
4,1002,2023-01-07,Mouse,$25.50,5,South
5,1006,2023-01-18,,$15.00,2,East
6,1007,2023-01-20,Laptop,$1200.00,1,North
7,1008,2023-01-22,External HDD,$80,4,USA
8,OrderID,OrderDate,Product,Price,Quantity,Region


In [5]:
messy_data_csv = """OrderID,OrderDate,Product,Price,Quantity,Region
1001,2023-01-05,Laptop,$100,2,North
1002,2023-01-07,Mouse,$25.50,5,South
1003,2023-01-10,Keyboard,,3,North
1004,2023-01-12,Monitor,$300,,"West"
1005,2023-01-15,Webcam,$45.99,1,East
1002,2023-01-07,Mouse,$25.50,5,South
1006,2023-01-18,,$15.00,2,East
1007,2023-01-20,Laptop,$1200.00,1, North
1008,2023-01-22,External HDD,$80,4,USA

"""

with open ('slaes_data_messy.csv','w') as f:
    f.write(messy_data_csv)

df = pd.read_csv('slaes_data_messy.csv')
df =df.set_index("OrderID")
df

Unnamed: 0_level_0,OrderDate,Product,Price,Quantity,Region
OrderID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1001,2023-01-05,Laptop,$100,2.0,North
1002,2023-01-07,Mouse,$25.50,5.0,South
1003,2023-01-10,Keyboard,,3.0,North
1004,2023-01-12,Monitor,$300,,West
1005,2023-01-15,Webcam,$45.99,1.0,East
1002,2023-01-07,Mouse,$25.50,5.0,South
1006,2023-01-18,,$15.00,2.0,East
1007,2023-01-20,Laptop,$1200.00,1.0,North
1008,2023-01-22,External HDD,$80,4.0,USA


In [51]:
df.isnull()

Unnamed: 0_level_0,OrderDate,Product,Price,Quantity,Region
OrderID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1001,False,False,False,False,False
1002,False,False,False,False,False
1003,False,False,True,False,False
1004,False,False,False,True,False
1005,False,False,False,False,False
1002,False,False,False,False,False
1006,False,True,False,False,False
1007,False,False,False,False,False
1008,False,False,False,False,False


In [52]:
df.head() #default value is 5 but you can use any number as per requirements

Unnamed: 0_level_0,OrderDate,Product,Price,Quantity,Region
OrderID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1001,2023-01-05,Laptop,$100,2.0,North
1002,2023-01-07,Mouse,$25.50,5.0,South
1003,2023-01-10,Keyboard,,3.0,North
1004,2023-01-12,Monitor,$300,,West
1005,2023-01-15,Webcam,$45.99,1.0,East


In [58]:
df.head(2)

Unnamed: 0_level_0,OrderDate,Product,Price,Quantity,Region
OrderID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1001,2023-01-05,Laptop,$100,2.0,North
1002,2023-01-07,Mouse,$25.50,5.0,South


In [53]:
df.tail()

Unnamed: 0_level_0,OrderDate,Product,Price,Quantity,Region
OrderID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1005,2023-01-15,Webcam,$45.99,1.0,East
1002,2023-01-07,Mouse,$25.50,5.0,South
1006,2023-01-18,,$15.00,2.0,East
1007,2023-01-20,Laptop,$1200.00,1.0,North
1008,2023-01-22,External HDD,$80,4.0,USA


In [59]:
df.tail(3)

Unnamed: 0_level_0,OrderDate,Product,Price,Quantity,Region
OrderID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1006,2023-01-18,,$15.00,2.0,East
1007,2023-01-20,Laptop,$1200.00,1.0,North
1008,2023-01-22,External HDD,$80,4.0,USA


In [54]:
df.shape

(9, 5)

In [57]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9 entries, 1001 to 1008
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   OrderDate  9 non-null      object 
 1   Product    8 non-null      object 
 2   Price      8 non-null      object 
 3   Quantity   8 non-null      float64
 4   Region     9 non-null      object 
dtypes: float64(1), object(4)
memory usage: 432.0+ bytes


In [56]:
df.describe()

Unnamed: 0,Quantity
count,8.0
mean,2.875
std,1.642081
min,1.0
25%,1.75
50%,2.5
75%,4.25
max,5.0


### First-Look Inspection: The "Medical Checkup" for Data

Never start cleaning without a proper checkup. These are your essential first commands.

*   `.head()`: View the first 5 rows.
*   `.tail()`: View the last 5 rows.
*   `.shape`: Get the number of rows and columns (rows, columns).
*   `.info()`: **CRUCIAL!** Get a summary of the DataFrame, including data types and non-null counts.
*   `.describe()`: Get a statistical summary of the numerical columns.

In [60]:
new_df= df[["Product","Quantity","Price"]]
new_df

Unnamed: 0_level_0,Product,Quantity,Price
OrderID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1001,Laptop,2.0,$100
1002,Mouse,5.0,$25.50
1003,Keyboard,3.0,
1004,Monitor,,$300
1005,Webcam,1.0,$45.99
1002,Mouse,5.0,$25.50
1006,,2.0,$15.00
1007,Laptop,1.0,$1200.00
1008,External HDD,4.0,$80


In [61]:
new_df.columns

Index(['Product', 'Quantity', 'Price'], dtype='object')

In [64]:
df.columns  ####very very very very important

Index(['OrderDate', 'Product', 'Price', 'Quantity', 'Region'], dtype='object')

In [3]:
import pandas as pd
import json
data = [
    {
        "id": 1,
        "name": "Sundar Joshi",
        "age": 22,
        "department": "Social Science",
        "salary": 6000
    },
    {
        "id": 2,
        "name": "Sita Sharma",
        "age": 25,
        "department": "Engineer",
        "salary": 5000
    },
    {
        "id": 3,
        "name": "Ram Karki",
        "age": 30,
        "department": "Front Desk",
        "salary": 5000
    }]
with open ("employee.json","w") as file:
    json.dump(data,file,indent=4)

print("JSON file create successfully!")


JSON file create successfully!


In [4]:
df = pd.read_json("employee.json")
df

Unnamed: 0,id,name,age,department,salary
0,1,Sundar Joshi,22,Social Science,6000
1,2,Sita Sharma,25,Engineer,5000
2,3,Ram Karki,30,Front Desk,5000


In [6]:
df.isnull()

Unnamed: 0_level_0,OrderDate,Product,Price,Quantity,Region
OrderID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1001,False,False,False,False,False
1002,False,False,False,False,False
1003,False,False,True,False,False
1004,False,False,False,True,False
1005,False,False,False,False,False
1002,False,False,False,False,False
1006,False,True,False,False,False
1007,False,False,False,False,False
1008,False,False,False,False,False


In [10]:
df.isnull().sum()


OrderDate    0
Product      1
Price        1
Quantity     1
Region       0
dtype: int64

In [11]:
df.isnull()

Unnamed: 0_level_0,OrderDate,Product,Price,Quantity,Region
OrderID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1001,False,False,False,False,False
1002,False,False,False,False,False
1003,False,False,True,False,False
1004,False,False,False,True,False
1005,False,False,False,False,False
1002,False,False,False,False,False
1006,False,True,False,False,False
1007,False,False,False,False,False
1008,False,False,False,False,False


## 3. The Data Cleaning Workflow

### Step 1: Handling Missing Values

Missing data is often represented as `NaN` (Not a Number). Our first step is to identify where they are and decide on a strategy.

**Strategy Options:**
1.  **Drop:** Remove rows or columns with missing values. (Use if the data is unusable or if you have a huge dataset and can afford to lose some rows).
2.  **Fill (Impute):** Replace missing values with something meaningful (e.g., 0, the mean, the median, or the most frequent value).

First, let's count the `NaN`s in each column.

## Cleaning `Product` and `Quantity`

*   The missing `Product` name makes that row less useful for sales analysis.
*   A missing `Quantity` could be assumed to be 0, but a missing price is harder to guess.

Let's start by filling the missing `Quantity` with the **median** value of the column. Using the median is often better than the mean because it's less sensitive to outliers.

And for the missing `Product`, we will fill it with the string 'Unknown'.

In [14]:
#fill the missing quantity witht the median of the coulumn
median_quantity = df['Quantity'].median()

print(f"The median quantity is: {median_quantity}")

df.fillna({"Quantity": median_quantity},inplace=True)

df.fillna({'Product': 'Unknown'}, inplace=True)

#df["Product"]=df["Product"].fillna("Unknown")

The median quantity is: 2.5


In [15]:
df

Unnamed: 0_level_0,OrderDate,Product,Price,Quantity,Region
OrderID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1001,2023-01-05,Laptop,$100,2.0,North
1002,2023-01-07,Mouse,$25.50,5.0,South
1003,2023-01-10,Keyboard,,3.0,North
1004,2023-01-12,Monitor,$300,2.5,West
1005,2023-01-15,Webcam,$45.99,1.0,East
1002,2023-01-07,Mouse,$25.50,5.0,South
1006,2023-01-18,Unknown,$15.00,2.0,East
1007,2023-01-20,Laptop,$1200.00,1.0,North
1008,2023-01-22,External HDD,$80,4.0,USA


In [17]:
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 9 entries, 1001 to 1008
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   OrderDate  9 non-null      object 
 1   Product    9 non-null      object 
 2   Price      8 non-null      object 
 3   Quantity   9 non-null      float64
 4   Region     9 non-null      object 
dtypes: float64(1), object(4)
memory usage: 432.0+ bytes


In [23]:
df["Price"]=df["Price"].str.replace("$","",regex=False) #removing the dollar sign

In [24]:
df

Unnamed: 0_level_0,OrderDate,Product,Price,Quantity,Region
OrderID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1001,2023-01-05,Laptop,100.0,2.0,North
1002,2023-01-07,Mouse,25.5,5.0,South
1003,2023-01-10,Keyboard,,3.0,North
1004,2023-01-12,Monitor,300.0,2.5,West
1005,2023-01-15,Webcam,45.99,1.0,East
1002,2023-01-07,Mouse,25.5,5.0,South
1006,2023-01-18,Unknown,15.0,2.0,East
1007,2023-01-20,Laptop,1200.0,1.0,North
1008,2023-01-22,External HDD,80.0,4.0,USA


In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9 entries, 1001 to 1008
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   OrderDate  9 non-null      object 
 1   Product    9 non-null      object 
 2   Price      8 non-null      object 
 3   Quantity   9 non-null      float64
 4   Region     9 non-null      object 
dtypes: float64(1), object(4)
memory usage: 432.0+ bytes


In [28]:
df["Price"]=df["Price"].astype(float) #converting object data type into float data type

In [30]:
df

Unnamed: 0_level_0,OrderDate,Product,Price,Quantity,Region
OrderID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1001,2023-01-05,Laptop,100.0,2.0,North
1002,2023-01-07,Mouse,25.5,5.0,South
1003,2023-01-10,Keyboard,,3.0,North
1004,2023-01-12,Monitor,300.0,2.5,West
1005,2023-01-15,Webcam,45.99,1.0,East
1002,2023-01-07,Mouse,25.5,5.0,South
1006,2023-01-18,Unknown,15.0,2.0,East
1007,2023-01-20,Laptop,1200.0,1.0,North
1008,2023-01-22,External HDD,80.0,4.0,USA


In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9 entries, 1001 to 1008
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   OrderDate  9 non-null      object 
 1   Product    9 non-null      object 
 2   Price      8 non-null      float64
 3   Quantity   9 non-null      float64
 4   Region     9 non-null      object 
dtypes: float64(2), object(3)
memory usage: 432.0+ bytes


In [33]:
mean_price = df['Price'].mean()

print(f"The mean of price is: {mean_price}")

df.fillna({"Price": mean_price},inplace=True)

The mean of price is: 223.99875


In [34]:
df

Unnamed: 0_level_0,OrderDate,Product,Price,Quantity,Region
OrderID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1001,2023-01-05,Laptop,100.0,2.0,North
1002,2023-01-07,Mouse,25.5,5.0,South
1003,2023-01-10,Keyboard,223.99875,3.0,North
1004,2023-01-12,Monitor,300.0,2.5,West
1005,2023-01-15,Webcam,45.99,1.0,East
1002,2023-01-07,Mouse,25.5,5.0,South
1006,2023-01-18,Unknown,15.0,2.0,East
1007,2023-01-20,Laptop,1200.0,1.0,North
1008,2023-01-22,External HDD,80.0,4.0,USA


In [36]:
df.isnull().sum()

OrderDate    0
Product      0
Price        0
Quantity     0
Region       0
dtype: int64

In [39]:
df.duplicated()

OrderID
1001    False
1002    False
1003    False
1004    False
1005    False
1002     True
1006    False
1007    False
1008    False
dtype: bool

In [40]:
dir(df)

['OrderDate',
 'Price',
 'Product',
 'Quantity',
 'Region',
 'T',
 '_AXIS_LEN',
 '_AXIS_ORDERS',
 '_AXIS_TO_AXIS_NUMBER',
 '_HANDLED_TYPES',
 '__abs__',
 '__add__',
 '__and__',
 '__annotations__',
 '__array__',
 '__array_priority__',
 '__array_ufunc__',
 '__array_wrap__',
 '__bool__',
 '__class__',
 '__contains__',
 '__copy__',
 '__dataframe__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__finalize__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__imod__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__nonzero__',
 '__or__',
 '__pos__',
 '__pow__',
 '__radd__'

In [44]:
df = df.drop_duplicates()

In [46]:
df

Unnamed: 0_level_0,OrderDate,Product,Price,Quantity,Region
OrderID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1001,2023-01-05,Laptop,100.0,2.0,North
1002,2023-01-07,Mouse,25.5,5.0,South
1003,2023-01-10,Keyboard,223.99875,3.0,North
1004,2023-01-12,Monitor,300.0,2.5,West
1005,2023-01-15,Webcam,45.99,1.0,East
1006,2023-01-18,Unknown,15.0,2.0,East
1007,2023-01-20,Laptop,1200.0,1.0,North
1008,2023-01-22,External HDD,80.0,4.0,USA


In [47]:
df.duplicated().sum()

0

In [48]:
#Drop rows where any column has a missing value
df.dropna(inplace=True)

#let's check our work
print("Dataframe after dropping rows with any remaining NaNs:")
print(df.isnull().sum())
print(f"\nNew shape of the DataFrame: {df.shape}")
df

Dataframe after dropping rows with any remaining NaNs:
OrderDate    0
Product      0
Price        0
Quantity     0
Region       0
dtype: int64

New shape of the DataFrame: (8, 5)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.dropna(inplace=True)


Unnamed: 0_level_0,OrderDate,Product,Price,Quantity,Region
OrderID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1001,2023-01-05,Laptop,100.0,2.0,North
1002,2023-01-07,Mouse,25.5,5.0,South
1003,2023-01-10,Keyboard,223.99875,3.0,North
1004,2023-01-12,Monitor,300.0,2.5,West
1005,2023-01-15,Webcam,45.99,1.0,East
1006,2023-01-18,Unknown,15.0,2.0,East
1007,2023-01-20,Laptop,1200.0,1.0,North
1008,2023-01-22,External HDD,80.0,4.0,USA


### Step 2: Handling Duplicates
Duplicate data can skew our analysis, leading to incorrect sums and counts.
* `.duplicated().sum()`: Count duplicate rows.
* `.drop_duplicates()`: Remove them.

In [50]:
#check for duplicates
print(f"Number of duplicate rows: {df.duplicated().sum()}")

df= df.drop_duplicates()
print(f"\nNow: {df.duplicated().sum()}")
df

Number of duplicate rows: 0

Now: 0


Unnamed: 0_level_0,OrderDate,Product,Price,Quantity,Region
OrderID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1001,2023-01-05,Laptop,100.0,2.0,North
1002,2023-01-07,Mouse,25.5,5.0,South
1003,2023-01-10,Keyboard,223.99875,3.0,North
1004,2023-01-12,Monitor,300.0,2.5,West
1005,2023-01-15,Webcam,45.99,1.0,East
1006,2023-01-18,Unknown,15.0,2.0,East
1007,2023-01-20,Laptop,1200.0,1.0,North
1008,2023-01-22,External HDD,80.0,4.0,USA


In [52]:
df.shape

(8, 5)

In [None]:
#to keep the encountered value
df= df.drop.duplicates(keep="last")


### Step 3: Correcting Data Types and Formatting

This is where the real magic happens. Our `Price` column is still an `object` because of the '$' signs and commas. We can't calculate total sales with it!

We'll use string methods (`.str`) to clean it up and then `.astype()` to convert it.

In [54]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8 entries, 1001 to 1008
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   OrderDate  8 non-null      object 
 1   Product    8 non-null      object 
 2   Price      8 non-null      float64
 3   Quantity   8 non-null      float64
 4   Region     8 non-null      object 
dtypes: float64(2), object(3)
memory usage: 384.0+ bytes


In [57]:
print(f"Data type of 'Price':{df['Price'].dtype}")

Data type of 'Price':float64


In [None]:
# 1. Remove the '$' sign and any commas using string replace.

df['Price'] = df['Price'].str.replace('$', '', regex=False)
df['Price'] = df['Price'].str.replace(',', '', regex=False)

# 2. Convert the cleaned column to a numeric type (float)
df['Price'] = df['Price'].astype(float)


print(f"Data type of 'Price' after cleaning: {df['Price'].dtype}")

In [58]:
# Let's also fix the leading space in the 'Region' column
df['Region'] = df['Region'].str.strip()

In [59]:
df

Unnamed: 0_level_0,OrderDate,Product,Price,Quantity,Region
OrderID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1001,2023-01-05,Laptop,100.0,2.0,North
1002,2023-01-07,Mouse,25.5,5.0,South
1003,2023-01-10,Keyboard,223.99875,3.0,North
1004,2023-01-12,Monitor,300.0,2.5,West
1005,2023-01-15,Webcam,45.99,1.0,East
1006,2023-01-18,Unknown,15.0,2.0,East
1007,2023-01-20,Laptop,1200.0,1.0,North
1008,2023-01-22,External HDD,80.0,4.0,USA


In [60]:
df['Total Sales'] = df['Price']*df['Quantity']
df

Unnamed: 0_level_0,OrderDate,Product,Price,Quantity,Region,Total Sales
OrderID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1001,2023-01-05,Laptop,100.0,2.0,North,200.0
1002,2023-01-07,Mouse,25.5,5.0,South,127.5
1003,2023-01-10,Keyboard,223.99875,3.0,North,671.99625
1004,2023-01-12,Monitor,300.0,2.5,West,750.0
1005,2023-01-15,Webcam,45.99,1.0,East,45.99
1006,2023-01-18,Unknown,15.0,2.0,East,30.0
1007,2023-01-20,Laptop,1200.0,1.0,North,1200.0
1008,2023-01-22,External HDD,80.0,4.0,USA,320.0


In [61]:
df['Total_Sales'] = df['Price']*df['Quantity']
df

Unnamed: 0_level_0,OrderDate,Product,Price,Quantity,Region,Total Sales,Total_Sales
OrderID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1001,2023-01-05,Laptop,100.0,2.0,North,200.0,200.0
1002,2023-01-07,Mouse,25.5,5.0,South,127.5,127.5
1003,2023-01-10,Keyboard,223.99875,3.0,North,671.99625,671.99625
1004,2023-01-12,Monitor,300.0,2.5,West,750.0,750.0
1005,2023-01-15,Webcam,45.99,1.0,East,45.99,45.99
1006,2023-01-18,Unknown,15.0,2.0,East,30.0,30.0
1007,2023-01-20,Laptop,1200.0,1.0,North,1200.0,1200.0
1008,2023-01-22,External HDD,80.0,4.0,USA,320.0,320.0


In [62]:
df.duplicated()

OrderID
1001    False
1002    False
1003    False
1004    False
1005    False
1006    False
1007    False
1008    False
dtype: bool

In [64]:
df

Unnamed: 0_level_0,OrderDate,Product,Price,Quantity,Region,Total Sales,Total_Sales
OrderID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1001,2023-01-05,Laptop,100.0,2.0,North,200.0,200.0
1002,2023-01-07,Mouse,25.5,5.0,South,127.5,127.5
1003,2023-01-10,Keyboard,223.99875,3.0,North,671.99625,671.99625
1004,2023-01-12,Monitor,300.0,2.5,West,750.0,750.0
1005,2023-01-15,Webcam,45.99,1.0,East,45.99,45.99
1006,2023-01-18,Unknown,15.0,2.0,East,30.0,30.0
1007,2023-01-20,Laptop,1200.0,1.0,North,1200.0,1200.0
1008,2023-01-22,External HDD,80.0,4.0,USA,320.0,320.0


In [66]:
df.drop("Total_Sales",axis = 1)

Unnamed: 0_level_0,OrderDate,Product,Price,Quantity,Region,Total Sales
OrderID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1001,2023-01-05,Laptop,100.0,2.0,North,200.0
1002,2023-01-07,Mouse,25.5,5.0,South,127.5
1003,2023-01-10,Keyboard,223.99875,3.0,North,671.99625
1004,2023-01-12,Monitor,300.0,2.5,West,750.0
1005,2023-01-15,Webcam,45.99,1.0,East,45.99
1006,2023-01-18,Unknown,15.0,2.0,East,30.0
1007,2023-01-20,Laptop,1200.0,1.0,North,1200.0
1008,2023-01-22,External HDD,80.0,4.0,USA,320.0


In [68]:
df['Total Sales'].sum()

3345.48625

In [69]:
df.to_csv('clean_data.csv',index= False)

## 5. Your Assignment

For your assignment, you will clean a new dataset: `student_performance.csv`. It contains information about student scores and has its own set of problems.

**Your Task:**
1.  **Create and Load the Data:** Run the provided code to create `student_performance_messy.csv` and load it into a DataFrame.
2.  **Inspect the Data:** Use `.info()`, `.head()`, and `.isnull().sum()` to understand the issues.
3.  **Clean the Data:**
    *   The `gender` column has inconsistent values (`M`, `F`, `Male`, `Female`). Standardize them to `M` and `F`.
    *   The `score` column has some values as percentages (e.g., "75%") and some missing. Convert all scores to a numeric format (e.g., 75.0).
   
    *   There is a `study_hours` column with some negative values, which is impossible. Replace any negative hours with 0.
    *   Check for and remove any duplicate rows.
4.  **Export the Result:** Save your final, cleaned DataFrame to a file named `student_performance_cleaned.csv`.

Here is the code to generate your assignment's messy dataset:

In [77]:
import pandas as pd

students_data_messy = """student_id,name,gender,score,study_hours
1,Alice,F,85,10
2,Bob,M,92,8
3,Charlie,Male,"78%",12
4,Diana,F,,15
5,Ethan,M,65,5
6,Fiona,Female,95,11
7,George,M,-5,4
5,Ethan,M,65,5
8,Hannah,F,88,9
9,Ian,Male,"62%",-2
10,Jane,F,,14
"""

with open ('stduent_performance_messy.csv','w') as f:
    f.write(students_data_messy)

df = pd.read_csv("stduent_performance_messy.csv")
df

Unnamed: 0,student_id,name,gender,score,study_hours
0,1,Alice,F,85,10
1,2,Bob,M,92,8
2,3,Charlie,Male,78%,12
3,4,Diana,F,,15
4,5,Ethan,M,65,5
5,6,Fiona,Female,95,11
6,7,George,M,-5,4
7,5,Ethan,M,65,5
8,8,Hannah,F,88,9
9,9,Ian,Male,62%,-2


In [78]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   student_id   11 non-null     int64 
 1   name         11 non-null     object
 2   gender       11 non-null     object
 3   score        9 non-null      object
 4   study_hours  11 non-null     int64 
dtypes: int64(2), object(3)
memory usage: 568.0+ bytes


In [79]:
df.isnull()

Unnamed: 0,student_id,name,gender,score,study_hours
0,False,False,False,False,False
1,False,False,False,False,False
2,False,False,False,False,False
3,False,False,False,True,False
4,False,False,False,False,False
5,False,False,False,False,False
6,False,False,False,False,False
7,False,False,False,False,False
8,False,False,False,False,False
9,False,False,False,False,False


In [80]:
df.head()

Unnamed: 0,student_id,name,gender,score,study_hours
0,1,Alice,F,85,10
1,2,Bob,M,92,8
2,3,Charlie,Male,78%,12
3,4,Diana,F,,15
4,5,Ethan,M,65,5


In [7]:
df.sum()

  df.sum()


student_id                                                    60
name           AliceBobCharlieDianaEthanFionaGeorgeEthanHanna...
gender                                    FMMaleFMFemaleMMFMaleF
study_hours                                                   91
dtype: object

In [81]:
df= df.set_index('student_id')
df

Unnamed: 0_level_0,name,gender,score,study_hours
student_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Alice,F,85,10
2,Bob,M,92,8
3,Charlie,Male,78%,12
4,Diana,F,,15
5,Ethan,M,65,5
6,Fiona,Female,95,11
7,George,M,-5,4
5,Ethan,M,65,5
8,Hannah,F,88,9
9,Ian,Male,62%,-2


In [82]:
df.shape

(11, 4)

In [83]:
df= df.reset_index()
df

Unnamed: 0,student_id,name,gender,score,study_hours
0,1,Alice,F,85,10
1,2,Bob,M,92,8
2,3,Charlie,Male,78%,12
3,4,Diana,F,,15
4,5,Ethan,M,65,5
5,6,Fiona,Female,95,11
6,7,George,M,-5,4
7,5,Ethan,M,65,5
8,8,Hannah,F,88,9
9,9,Ian,Male,62%,-2


In [84]:
df['score'] = df['score'].str.replace('%', '', regex=False)
df

Unnamed: 0,student_id,name,gender,score,study_hours
0,1,Alice,F,85.0,10
1,2,Bob,M,92.0,8
2,3,Charlie,Male,78.0,12
3,4,Diana,F,,15
4,5,Ethan,M,65.0,5
5,6,Fiona,Female,95.0,11
6,7,George,M,-5.0,4
7,5,Ethan,M,65.0,5
8,8,Hannah,F,88.0,9
9,9,Ian,Male,62.0,-2


In [85]:
df['gender'].nunique()

4

In [86]:
df['gender']=df['gender'].str.replace("Male","M")
df['gender']=df['gender'].str.replace("Female","F")
df

Unnamed: 0,student_id,name,gender,score,study_hours
0,1,Alice,F,85.0,10
1,2,Bob,M,92.0,8
2,3,Charlie,M,78.0,12
3,4,Diana,F,,15
4,5,Ethan,M,65.0,5
5,6,Fiona,F,95.0,11
6,7,George,M,-5.0,4
7,5,Ethan,M,65.0,5
8,8,Hannah,F,88.0,9
9,9,Ian,M,62.0,-2


In [87]:
df["score"]=df["score"].astype(float) 
df

Unnamed: 0,student_id,name,gender,score,study_hours
0,1,Alice,F,85.0,10
1,2,Bob,M,92.0,8
2,3,Charlie,M,78.0,12
3,4,Diana,F,,15
4,5,Ethan,M,65.0,5
5,6,Fiona,F,95.0,11
6,7,George,M,-5.0,4
7,5,Ethan,M,65.0,5
8,8,Hannah,F,88.0,9
9,9,Ian,M,62.0,-2


In [88]:
df["score"]=df["score"].abs()
df

Unnamed: 0,student_id,name,gender,score,study_hours
0,1,Alice,F,85.0,10
1,2,Bob,M,92.0,8
2,3,Charlie,M,78.0,12
3,4,Diana,F,,15
4,5,Ethan,M,65.0,5
5,6,Fiona,F,95.0,11
6,7,George,M,5.0,4
7,5,Ethan,M,65.0,5
8,8,Hannah,F,88.0,9
9,9,Ian,M,62.0,-2


In [89]:
mean_score = df['score'].mean()

print(f"The mean of price is: {mean_score}")

df.fillna({"score": mean_score},inplace=True)

The mean of price is: 70.55555555555556


In [90]:
df

Unnamed: 0,student_id,name,gender,score,study_hours
0,1,Alice,F,85.0,10
1,2,Bob,M,92.0,8
2,3,Charlie,M,78.0,12
3,4,Diana,F,70.555556,15
4,5,Ethan,M,65.0,5
5,6,Fiona,F,95.0,11
6,7,George,M,5.0,4
7,5,Ethan,M,65.0,5
8,8,Hannah,F,88.0,9
9,9,Ian,M,62.0,-2


In [91]:
a= df.duplicated()

In [92]:

print(df)

    student_id     name gender      score  study_hours
0            1    Alice      F  85.000000           10
1            2      Bob      M  92.000000            8
2            3  Charlie      M  78.000000           12
3            4    Diana      F  70.555556           15
4            5    Ethan      M  65.000000            5
5            6    Fiona      F  95.000000           11
6            7   George      M   5.000000            4
7            5    Ethan      M  65.000000            5
8            8   Hannah      F  88.000000            9
9            9      Ian      M  62.000000           -2
10          10     Jane      F  70.555556           14


In [93]:
df=df.drop_duplicates()
df

Unnamed: 0,student_id,name,gender,score,study_hours
0,1,Alice,F,85.0,10
1,2,Bob,M,92.0,8
2,3,Charlie,M,78.0,12
3,4,Diana,F,70.555556,15
4,5,Ethan,M,65.0,5
5,6,Fiona,F,95.0,11
6,7,George,M,5.0,4
8,8,Hannah,F,88.0,9
9,9,Ian,M,62.0,-2
10,10,Jane,F,70.555556,14


In [94]:
print(df)

    student_id     name gender      score  study_hours
0            1    Alice      F  85.000000           10
1            2      Bob      M  92.000000            8
2            3  Charlie      M  78.000000           12
3            4    Diana      F  70.555556           15
4            5    Ethan      M  65.000000            5
5            6    Fiona      F  95.000000           11
6            7   George      M   5.000000            4
8            8   Hannah      F  88.000000            9
9            9      Ian      M  62.000000           -2
10          10     Jane      F  70.555556           14


In [95]:
type(df)

pandas.core.frame.DataFrame

In [96]:
display(df)

Unnamed: 0,student_id,name,gender,score,study_hours
0,1,Alice,F,85.0,10
1,2,Bob,M,92.0,8
2,3,Charlie,M,78.0,12
3,4,Diana,F,70.555556,15
4,5,Ethan,M,65.0,5
5,6,Fiona,F,95.0,11
6,7,George,M,5.0,4
8,8,Hannah,F,88.0,9
9,9,Ian,M,62.0,-2
10,10,Jane,F,70.555556,14


In [97]:
df['score']=df['score'].round(2)
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['score']=df['score'].round(2)


Unnamed: 0,student_id,name,gender,score,study_hours
0,1,Alice,F,85.0,10
1,2,Bob,M,92.0,8
2,3,Charlie,M,78.0,12
3,4,Diana,F,70.56,15
4,5,Ethan,M,65.0,5
5,6,Fiona,F,95.0,11
6,7,George,M,5.0,4
8,8,Hannah,F,88.0,9
9,9,Ian,M,62.0,-2
10,10,Jane,F,70.56,14


In [98]:
df['study_hours']=df['study_hours'].abs()
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['study_hours']=df['study_hours'].abs()


Unnamed: 0,student_id,name,gender,score,study_hours
0,1,Alice,F,85.0,10
1,2,Bob,M,92.0,8
2,3,Charlie,M,78.0,12
3,4,Diana,F,70.56,15
4,5,Ethan,M,65.0,5
5,6,Fiona,F,95.0,11
6,7,George,M,5.0,4
8,8,Hannah,F,88.0,9
9,9,Ian,M,62.0,2
10,10,Jane,F,70.56,14


In [99]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 10
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   student_id   10 non-null     int64  
 1   name         10 non-null     object 
 2   gender       10 non-null     object 
 3   score        10 non-null     float64
 4   study_hours  10 non-null     int64  
dtypes: float64(1), int64(2), object(2)
memory usage: 480.0+ bytes


In [100]:
df

Unnamed: 0,student_id,name,gender,score,study_hours
0,1,Alice,F,85.0,10
1,2,Bob,M,92.0,8
2,3,Charlie,M,78.0,12
3,4,Diana,F,70.56,15
4,5,Ethan,M,65.0,5
5,6,Fiona,F,95.0,11
6,7,George,M,5.0,4
8,8,Hannah,F,88.0,9
9,9,Ian,M,62.0,2
10,10,Jane,F,70.56,14


In [101]:
mean_score1 =df['score'].mean()
print(mean_score1)

71.112


In [102]:
df.to_csv('student_performance_cleaned.csv',index= False)

In [103]:
corr_df = df[['score','study_hours']]
corr_df.corr()

Unnamed: 0,score,study_hours
score,1.0,0.474153
study_hours,0.474153,1.0


In [2]:
import pandas as pd

df = pd.read_csv('laptopData.csv')
df

Unnamed: 0.1,Unnamed: 0,Company,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight,Price
0,0.0,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,1.37kg,71378.6832
1,1.0,Apple,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34kg,47895.5232
2,2.0,HP,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,1.86kg,30636.0000
3,3.0,Apple,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,1.83kg,135195.3360
4,4.0,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,1.37kg,96095.8080
...,...,...,...,...,...,...,...,...,...,...,...,...
1298,1298.0,Lenovo,2 in 1 Convertible,14,IPS Panel Full HD / Touchscreen 1920x1080,Intel Core i7 6500U 2.5GHz,4GB,128GB SSD,Intel HD Graphics 520,Windows 10,1.8kg,33992.6400
1299,1299.0,Lenovo,2 in 1 Convertible,13.3,IPS Panel Quad HD+ / Touchscreen 3200x1800,Intel Core i7 6500U 2.5GHz,16GB,512GB SSD,Intel HD Graphics 520,Windows 10,1.3kg,79866.7200
1300,1300.0,Lenovo,Notebook,14,1366x768,Intel Celeron Dual Core N3050 1.6GHz,2GB,64GB Flash Storage,Intel HD Graphics,Windows 10,1.5kg,12201.1200
1301,1301.0,HP,Notebook,15.6,1366x768,Intel Core i7 6500U 2.5GHz,6GB,1TB HDD,AMD Radeon R5 M330,Windows 10,2.19kg,40705.9200


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Unnamed: 0        1273 non-null   float64
 1   Company           1273 non-null   object 
 2   TypeName          1273 non-null   object 
 3   Inches            1273 non-null   object 
 4   ScreenResolution  1273 non-null   object 
 5   Cpu               1273 non-null   object 
 6   Ram               1273 non-null   object 
 7   Memory            1273 non-null   object 
 8   Gpu               1273 non-null   object 
 9   OpSys             1273 non-null   object 
 10  Weight            1273 non-null   object 
 11  Price             1273 non-null   float64
dtypes: float64(2), object(10)
memory usage: 122.3+ KB


In [4]:
df.isnull().sum()

Unnamed: 0          30
Company             30
TypeName            30
Inches              30
ScreenResolution    30
Cpu                 30
Ram                 30
Memory              30
Gpu                 30
OpSys               30
Weight              30
Price               30
dtype: int64

In [5]:
df = df.dropna()

In [6]:
df

Unnamed: 0.1,Unnamed: 0,Company,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight,Price
0,0.0,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,1.37kg,71378.6832
1,1.0,Apple,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34kg,47895.5232
2,2.0,HP,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,1.86kg,30636.0000
3,3.0,Apple,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,1.83kg,135195.3360
4,4.0,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,1.37kg,96095.8080
...,...,...,...,...,...,...,...,...,...,...,...,...
1298,1298.0,Lenovo,2 in 1 Convertible,14,IPS Panel Full HD / Touchscreen 1920x1080,Intel Core i7 6500U 2.5GHz,4GB,128GB SSD,Intel HD Graphics 520,Windows 10,1.8kg,33992.6400
1299,1299.0,Lenovo,2 in 1 Convertible,13.3,IPS Panel Quad HD+ / Touchscreen 3200x1800,Intel Core i7 6500U 2.5GHz,16GB,512GB SSD,Intel HD Graphics 520,Windows 10,1.3kg,79866.7200
1300,1300.0,Lenovo,Notebook,14,1366x768,Intel Celeron Dual Core N3050 1.6GHz,2GB,64GB Flash Storage,Intel HD Graphics,Windows 10,1.5kg,12201.1200
1301,1301.0,HP,Notebook,15.6,1366x768,Intel Core i7 6500U 2.5GHz,6GB,1TB HDD,AMD Radeon R5 M330,Windows 10,2.19kg,40705.9200


In [7]:
df.isnull().sum()

Unnamed: 0          0
Company             0
TypeName            0
Inches              0
ScreenResolution    0
Cpu                 0
Ram                 0
Memory              0
Gpu                 0
OpSys               0
Weight              0
Price               0
dtype: int64

In [8]:
df.rename(columns={"Unnamed: 0": "S.N."}, inplace=True)
df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.rename(columns={"Unnamed: 0": "S.N."}, inplace=True)


Unnamed: 0,S.N.,Company,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight,Price
0,0.0,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,1.37kg,71378.6832
1,1.0,Apple,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34kg,47895.5232
2,2.0,HP,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,1.86kg,30636.0000
3,3.0,Apple,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,1.83kg,135195.3360
4,4.0,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,1.37kg,96095.8080
...,...,...,...,...,...,...,...,...,...,...,...,...
1298,1298.0,Lenovo,2 in 1 Convertible,14,IPS Panel Full HD / Touchscreen 1920x1080,Intel Core i7 6500U 2.5GHz,4GB,128GB SSD,Intel HD Graphics 520,Windows 10,1.8kg,33992.6400
1299,1299.0,Lenovo,2 in 1 Convertible,13.3,IPS Panel Quad HD+ / Touchscreen 3200x1800,Intel Core i7 6500U 2.5GHz,16GB,512GB SSD,Intel HD Graphics 520,Windows 10,1.3kg,79866.7200
1300,1300.0,Lenovo,Notebook,14,1366x768,Intel Celeron Dual Core N3050 1.6GHz,2GB,64GB Flash Storage,Intel HD Graphics,Windows 10,1.5kg,12201.1200
1301,1301.0,HP,Notebook,15.6,1366x768,Intel Core i7 6500U 2.5GHz,6GB,1TB HDD,AMD Radeon R5 M330,Windows 10,2.19kg,40705.9200


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1273 entries, 0 to 1302
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   S.N.              1273 non-null   float64
 1   Company           1273 non-null   object 
 2   TypeName          1273 non-null   object 
 3   Inches            1273 non-null   object 
 4   ScreenResolution  1273 non-null   object 
 5   Cpu               1273 non-null   object 
 6   Ram               1273 non-null   object 
 7   Memory            1273 non-null   object 
 8   Gpu               1273 non-null   object 
 9   OpSys             1273 non-null   object 
 10  Weight            1273 non-null   object 
 11  Price             1273 non-null   float64
dtypes: float64(2), object(10)
memory usage: 129.3+ KB


In [10]:
df["S.N."]=df["S.N."].astype(int)
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["S.N."]=df["S.N."].astype(int)


Unnamed: 0,S.N.,Company,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight,Price
0,0,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,1.37kg,71378.6832
1,1,Apple,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34kg,47895.5232
2,2,HP,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,1.86kg,30636.0000
3,3,Apple,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,1.83kg,135195.3360
4,4,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,1.37kg,96095.8080
...,...,...,...,...,...,...,...,...,...,...,...,...
1298,1298,Lenovo,2 in 1 Convertible,14,IPS Panel Full HD / Touchscreen 1920x1080,Intel Core i7 6500U 2.5GHz,4GB,128GB SSD,Intel HD Graphics 520,Windows 10,1.8kg,33992.6400
1299,1299,Lenovo,2 in 1 Convertible,13.3,IPS Panel Quad HD+ / Touchscreen 3200x1800,Intel Core i7 6500U 2.5GHz,16GB,512GB SSD,Intel HD Graphics 520,Windows 10,1.3kg,79866.7200
1300,1300,Lenovo,Notebook,14,1366x768,Intel Celeron Dual Core N3050 1.6GHz,2GB,64GB Flash Storage,Intel HD Graphics,Windows 10,1.5kg,12201.1200
1301,1301,HP,Notebook,15.6,1366x768,Intel Core i7 6500U 2.5GHz,6GB,1TB HDD,AMD Radeon R5 M330,Windows 10,2.19kg,40705.9200


In [11]:
df =df.set_index("S.N.")
df

Unnamed: 0_level_0,Company,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight,Price
S.N.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,1.37kg,71378.6832
1,Apple,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34kg,47895.5232
2,HP,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,1.86kg,30636.0000
3,Apple,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,1.83kg,135195.3360
4,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,1.37kg,96095.8080
...,...,...,...,...,...,...,...,...,...,...,...
1298,Lenovo,2 in 1 Convertible,14,IPS Panel Full HD / Touchscreen 1920x1080,Intel Core i7 6500U 2.5GHz,4GB,128GB SSD,Intel HD Graphics 520,Windows 10,1.8kg,33992.6400
1299,Lenovo,2 in 1 Convertible,13.3,IPS Panel Quad HD+ / Touchscreen 3200x1800,Intel Core i7 6500U 2.5GHz,16GB,512GB SSD,Intel HD Graphics 520,Windows 10,1.3kg,79866.7200
1300,Lenovo,Notebook,14,1366x768,Intel Celeron Dual Core N3050 1.6GHz,2GB,64GB Flash Storage,Intel HD Graphics,Windows 10,1.5kg,12201.1200
1301,HP,Notebook,15.6,1366x768,Intel Core i7 6500U 2.5GHz,6GB,1TB HDD,AMD Radeon R5 M330,Windows 10,2.19kg,40705.9200


In [12]:
df.shape


(1273, 11)

In [13]:
df.describe()

Unnamed: 0,Price
count,1273.0
mean,59955.814073
std,37332.251005
min,9270.72
25%,31914.72
50%,52161.12
75%,79333.3872
max,324954.72


In [14]:
df.rename(columns={"Weight": "Weight(in Kg)"}, inplace=True)
df

Unnamed: 0_level_0,Company,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight(in Kg),Price
S.N.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,1.37kg,71378.6832
1,Apple,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34kg,47895.5232
2,HP,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,1.86kg,30636.0000
3,Apple,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,1.83kg,135195.3360
4,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,1.37kg,96095.8080
...,...,...,...,...,...,...,...,...,...,...,...
1298,Lenovo,2 in 1 Convertible,14,IPS Panel Full HD / Touchscreen 1920x1080,Intel Core i7 6500U 2.5GHz,4GB,128GB SSD,Intel HD Graphics 520,Windows 10,1.8kg,33992.6400
1299,Lenovo,2 in 1 Convertible,13.3,IPS Panel Quad HD+ / Touchscreen 3200x1800,Intel Core i7 6500U 2.5GHz,16GB,512GB SSD,Intel HD Graphics 520,Windows 10,1.3kg,79866.7200
1300,Lenovo,Notebook,14,1366x768,Intel Celeron Dual Core N3050 1.6GHz,2GB,64GB Flash Storage,Intel HD Graphics,Windows 10,1.5kg,12201.1200
1301,HP,Notebook,15.6,1366x768,Intel Core i7 6500U 2.5GHz,6GB,1TB HDD,AMD Radeon R5 M330,Windows 10,2.19kg,40705.9200


In [15]:
df["Weight(in Kg)"]=df["Weight(in Kg)"].str.replace("kg","",regex=False) #removing kg
df

Unnamed: 0_level_0,Company,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight(in Kg),Price
S.N.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,1.37,71378.6832
1,Apple,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34,47895.5232
2,HP,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,1.86,30636.0000
3,Apple,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,1.83,135195.3360
4,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,1.37,96095.8080
...,...,...,...,...,...,...,...,...,...,...,...
1298,Lenovo,2 in 1 Convertible,14,IPS Panel Full HD / Touchscreen 1920x1080,Intel Core i7 6500U 2.5GHz,4GB,128GB SSD,Intel HD Graphics 520,Windows 10,1.8,33992.6400
1299,Lenovo,2 in 1 Convertible,13.3,IPS Panel Quad HD+ / Touchscreen 3200x1800,Intel Core i7 6500U 2.5GHz,16GB,512GB SSD,Intel HD Graphics 520,Windows 10,1.3,79866.7200
1300,Lenovo,Notebook,14,1366x768,Intel Celeron Dual Core N3050 1.6GHz,2GB,64GB Flash Storage,Intel HD Graphics,Windows 10,1.5,12201.1200
1301,HP,Notebook,15.6,1366x768,Intel Core i7 6500U 2.5GHz,6GB,1TB HDD,AMD Radeon R5 M330,Windows 10,2.19,40705.9200


In [16]:
df.describe()

Unnamed: 0,Price
count,1273.0
mean,59955.814073
std,37332.251005
min,9270.72
25%,31914.72
50%,52161.12
75%,79333.3872
max,324954.72


In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1273 entries, 0 to 1302
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Company           1273 non-null   object 
 1   TypeName          1273 non-null   object 
 2   Inches            1273 non-null   object 
 3   ScreenResolution  1273 non-null   object 
 4   Cpu               1273 non-null   object 
 5   Ram               1273 non-null   object 
 6   Memory            1273 non-null   object 
 7   Gpu               1273 non-null   object 
 8   OpSys             1273 non-null   object 
 9   Weight(in Kg)     1273 non-null   object 
 10  Price             1273 non-null   float64
dtypes: float64(1), object(10)
memory usage: 114.4+ KB


In [19]:
df["Weight(in Kg)"]=df["Weight(in Kg)"].astype(str)
df

Unnamed: 0_level_0,Company,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight(in Kg),Price
S.N.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,1.37,71378.6832
1,Apple,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34,47895.5232
2,HP,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,1.86,30636.0000
3,Apple,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,1.83,135195.3360
4,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,1.37,96095.8080
...,...,...,...,...,...,...,...,...,...,...,...
1298,Lenovo,2 in 1 Convertible,14,IPS Panel Full HD / Touchscreen 1920x1080,Intel Core i7 6500U 2.5GHz,4GB,128GB SSD,Intel HD Graphics 520,Windows 10,1.8,33992.6400
1299,Lenovo,2 in 1 Convertible,13.3,IPS Panel Quad HD+ / Touchscreen 3200x1800,Intel Core i7 6500U 2.5GHz,16GB,512GB SSD,Intel HD Graphics 520,Windows 10,1.3,79866.7200
1300,Lenovo,Notebook,14,1366x768,Intel Celeron Dual Core N3050 1.6GHz,2GB,64GB Flash Storage,Intel HD Graphics,Windows 10,1.5,12201.1200
1301,HP,Notebook,15.6,1366x768,Intel Core i7 6500U 2.5GHz,6GB,1TB HDD,AMD Radeon R5 M330,Windows 10,2.19,40705.9200


In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1273 entries, 0 to 1302
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Company           1273 non-null   object 
 1   TypeName          1273 non-null   object 
 2   Inches            1273 non-null   object 
 3   ScreenResolution  1273 non-null   object 
 4   Cpu               1273 non-null   object 
 5   Ram               1273 non-null   object 
 6   Memory            1273 non-null   object 
 7   Gpu               1273 non-null   object 
 8   OpSys             1273 non-null   object 
 9   Weight(in Kg)     1273 non-null   object 
 10  Price             1273 non-null   float64
dtypes: float64(1), object(10)
memory usage: 114.4+ KB


In [23]:
df["Weight(in Kg)"]=df["Weight(in Kg)"].astype(float)
df

NameError: name 'float64' is not defined

In [24]:
df["Inches"]=df["Inches"].astype(float)
df

ValueError: could not convert string to float: '?'

In [27]:
a=df['Weight(in Kg)'].value_counts()
a

Weight(in Kg)
2.2     111
2.1      57
2.4      43
2.3      41
2.5      37
       ... 
1.41      1
3.6       1
4.7       1
4.33      1
4.0       1
Name: count, Length: 189, dtype: int64

In [28]:
print(a)

Weight(in Kg)
2.2     111
2.1      57
2.4      43
2.3      41
2.5      37
       ... 
1.41      1
3.6       1
4.7       1
4.33      1
4.0       1
Name: count, Length: 189, dtype: int64


In [29]:
df["Weight(in Kg)"]=df["Weight(in Kg)"].str.replace("?","") #removing ?
df

Unnamed: 0_level_0,Company,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight(in Kg),Price
S.N.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,1.37,71378.6832
1,Apple,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34,47895.5232
2,HP,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,1.86,30636.0000
3,Apple,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,1.83,135195.3360
4,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,1.37,96095.8080
...,...,...,...,...,...,...,...,...,...,...,...
1298,Lenovo,2 in 1 Convertible,14,IPS Panel Full HD / Touchscreen 1920x1080,Intel Core i7 6500U 2.5GHz,4GB,128GB SSD,Intel HD Graphics 520,Windows 10,1.8,33992.6400
1299,Lenovo,2 in 1 Convertible,13.3,IPS Panel Quad HD+ / Touchscreen 3200x1800,Intel Core i7 6500U 2.5GHz,16GB,512GB SSD,Intel HD Graphics 520,Windows 10,1.3,79866.7200
1300,Lenovo,Notebook,14,1366x768,Intel Celeron Dual Core N3050 1.6GHz,2GB,64GB Flash Storage,Intel HD Graphics,Windows 10,1.5,12201.1200
1301,HP,Notebook,15.6,1366x768,Intel Core i7 6500U 2.5GHz,6GB,1TB HDD,AMD Radeon R5 M330,Windows 10,2.19,40705.9200


In [30]:
df.isnull().sum()

Company             0
TypeName            0
Inches              0
ScreenResolution    0
Cpu                 0
Ram                 0
Memory              0
Gpu                 0
OpSys               0
Weight(in Kg)       0
Price               0
dtype: int64

In [31]:
df.iloc[200:210,:]

Unnamed: 0_level_0,Company,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight(in Kg),Price
S.N.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
207,Dell,Ultrabook,13.3,IPS Panel 4K Ultra HD / Touchscreen 3840x2160,Intel Core i7 8550U 1.8GHz,8GB,256GB SSD,Intel UHD Graphics 620,Windows 10,1.21,103842.72
208,Dell,Ultrabook,13.3,Full HD 1920x1080,Intel Core i7 8550U 1.8GHz,8GB,256GB SSD,Intel UHD Graphics 620,Windows 10,,77202.72
210,Acer,Notebook,15.6,Full HD 1920x1080,Intel Core i7 7700HQ 2.8GHz,8GB,1TB HDD,Nvidia GeForce GTX 1050,Linux,2.4,41505.12
211,Asus,Gaming,17.3,Full HD 1920x1080,Intel Core i7 7700HQ 2.8GHz,16GB,256GB SSD + 1TB HDD,Nvidia GeForce GTX 1050,Windows 10,2.9,74964.96
212,HP,Notebook,15.6,Full HD 1920x1080,Intel Core i3 6006U 2GHz,4GB,500GB HDD,Intel HD Graphics 520,No OS,2.1,18594.72
213,Lenovo,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,2.2,29250.72
214,Huawei,Ultrabook,13.0,IPS Panel Full HD 2160x1440,Intel Core i7 7500U 2.7GHz,8GB,512GB SSD,Intel HD Graphics 620,Windows 10,1.05,79866.72
215,Dell,Ultrabook,13.3,IPS Panel Full HD 1920x1080,Intel Core i7 8550U 1.8GHz,8GB,256GB SSD,AMD Radeon 530,Windows 10,1.4,49650.5664
216,Lenovo,Notebook,17.3,1600x900,Intel Core i5 7200U 2.5GHz,8GB,1TB HDD,Nvidia GeForce GTX 940MX,No OS,2.8,31381.92
217,HP,Notebook,14.0,Full HD 1920x1080,Intel Core i7 8550U 1.8GHz,8GB,256GB SSD,Nvidia GeForce 930MX,Windows 10,1.63,54931.68


In [33]:
df.isnull().sum()

Company             0
TypeName            0
Inches              0
ScreenResolution    0
Cpu                 0
Ram                 0
Memory              0
Gpu                 0
OpSys               0
Weight(in Kg)       0
Price               0
dtype: int64

In [34]:
df.drop(208)

Unnamed: 0_level_0,Company,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight(in Kg),Price
S.N.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,1.37,71378.6832
1,Apple,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34,47895.5232
2,HP,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,1.86,30636.0000
3,Apple,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,1.83,135195.3360
4,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,1.37,96095.8080
...,...,...,...,...,...,...,...,...,...,...,...
1298,Lenovo,2 in 1 Convertible,14,IPS Panel Full HD / Touchscreen 1920x1080,Intel Core i7 6500U 2.5GHz,4GB,128GB SSD,Intel HD Graphics 520,Windows 10,1.8,33992.6400
1299,Lenovo,2 in 1 Convertible,13.3,IPS Panel Quad HD+ / Touchscreen 3200x1800,Intel Core i7 6500U 2.5GHz,16GB,512GB SSD,Intel HD Graphics 520,Windows 10,1.3,79866.7200
1300,Lenovo,Notebook,14,1366x768,Intel Celeron Dual Core N3050 1.6GHz,2GB,64GB Flash Storage,Intel HD Graphics,Windows 10,1.5,12201.1200
1301,HP,Notebook,15.6,1366x768,Intel Core i7 6500U 2.5GHz,6GB,1TB HDD,AMD Radeon R5 M330,Windows 10,2.19,40705.9200


In [35]:
df["Weight(in Kg)"]=df["Weight(in Kg)"].astype(float)
df

ValueError: could not convert string to float: ''