# Cleaning and Preparing Data with Pandas

## Introduction

In this project, we will demonstrate data cleaning and preparation techniques using Pandas. We will handle missing values, convert data types, filter data, and save the cleaned dataset

## Steps

1. Load the Dataset
2. Understand the Data
3. Handle Missing Values
4. Convert Data Types
5. Filter Data
6. Save the Cleaned Data

### Environment Setup

In [None]:
import pandas as pd

## 1. Load Data Set

We start by loading the Seattle weather dataset using Pandas. This allows us to read the CSV file and display the first few rows to understand its structure.

In [12]:
data = pd.read_csv(r'seattle-weather.csv')
data.head()

Unnamed: 0,date,precipitation,temp_max,temp_min,wind,weather
0,2012-01-01,0.0,12.8,5.0,4.7,drizzle
1,2012-01-02,10.9,10.6,2.8,4.5,rain
2,2012-01-03,0.8,11.7,7.2,2.3,rain
3,2012-01-04,20.3,12.2,5.6,4.7,rain
4,2012-01-05,1.3,8.9,2.8,6.1,rain


## 2. Understand the Data

We perform a quick overview of the dataset to understand its structure, data types, and identify any obvious issues such as missing values.

In [13]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1461 entries, 0 to 1460
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   date           1461 non-null   object 
 1   precipitation  1461 non-null   float64
 2   temp_max       1461 non-null   float64
 3   temp_min       1461 non-null   float64
 4   wind           1461 non-null   float64
 5   weather        1461 non-null   object 
dtypes: float64(4), object(2)
memory usage: 68.6+ KB


In [4]:
data.describe(include = 'all')

Unnamed: 0,date,precipitation,temp_max,temp_min,wind,weather
count,1461,1461.0,1461.0,1461.0,1461.0,1461
unique,1461,,,,,5
top,2012-01-01,,,,,rain
freq,1,,,,,641
mean,,3.029432,16.439083,8.234771,3.241136,
std,,6.680194,7.349758,5.023004,1.437825,
min,,0.0,-1.6,-7.1,0.4,
25%,,0.0,10.6,4.4,2.2,
50%,,0.0,15.6,8.3,3.0,
75%,,2.8,22.2,12.2,4.0,


In [5]:
data.isnull().sum()

date             0
precipitation    0
temp_max         0
temp_min         0
wind             0
weather          0
dtype: int64

## 3. Handle Missing Values

We handle missing values in the dataset by either dropping them or filling them with appropriate values. This step ensures that our analysis is not affected by incomplete data.

In [14]:
data_filled = data.fillna(0)

print("Data after filling missing values with 0:\n")
data_filled.head()

Data after filling missing values with 0:



Unnamed: 0,date,precipitation,temp_max,temp_min,wind,weather
0,2012-01-01,0.0,12.8,5.0,4.7,drizzle
1,2012-01-02,10.9,10.6,2.8,4.5,rain
2,2012-01-03,0.8,11.7,7.2,2.3,rain
3,2012-01-04,20.3,12.2,5.6,4.7,rain
4,2012-01-05,1.3,8.9,2.8,6.1,rain


## 4. Convert Data Types

We convert columns to appropriate data types. For example, the 'date' column is converted to datetime format, and numerical columns are converted to float or integer types.

In [15]:
if 'date' in data.columns:
    data['date'] = pd.to_datetime(data['date'])

data['precipitation'] = data['precipitation'].astype(float)
data['temp_max'] = data['temp_max'].astype(float)
data['temp_min'] = data['temp_min'].astype(float)
data['wind'] = data['wind'].astype(float)
print("Data types after conversion:")
data.dtypes

Data types after conversion:


date             datetime64[ns]
precipitation           float64
temp_max                float64
temp_min                float64
wind                    float64
weather                  object
dtype: object

## 5. Filter Data

We filter the dataset to include only relevant rows. In this case, we filter the data to include only records from 2013 onwards.

In [16]:
data_filtered = data[data['date'] >= '2013-01-01']
print("Data after filtering by date:")
data_filtered.head()

Data after filtering by date:


Unnamed: 0,date,precipitation,temp_max,temp_min,wind,weather
366,2013-01-01,0.0,5.0,-2.8,2.7,sun
367,2013-01-02,0.0,6.1,-1.1,3.2,sun
368,2013-01-03,4.1,6.7,-1.7,3.0,rain
369,2013-01-04,2.5,10.0,2.2,2.8,rain
370,2013-01-05,3.0,6.7,4.4,3.1,rain


## 6. Save the Cleaned Data

Finally, we save the cleaned dataset to a new CSV file. This cleaned data will be used in following analysis and visualizations.

In [17]:
clean_file = r'cleaned_seattle-weather.csv'
data_filtered.to_csv(clean_file, index=False)