# Data Wrangling with Pandas
    In this notebook, we will be working with Ames, IA housing data. The dataset contains information about various houses in Ames, including their sale prices and various features. Inside this notebook, we will perform data wrangling tasks such as cleaning, transforming, and summarizing the data using the Pandas library in Python.

    Before we start, we will import the necessary libraries and load the dataset.
## 3 Importing Libraries
    We will be using numpy and pandas libraries for data manipulation and analysis.
    

In [51]:
import numpy as np
import pandas as pd


pd.set_option('display.max_columns',100) #allows for up to 100 columns to be displayed when viewing a dataframe


Next we will load the dataset using pandas.
## 4 Import Data


In [52]:
df_realestate = pd.read_csv('data/Real Estate Data.csv',
                            index_col=0, # index_col=0 to use the first column as index
                            header=0)

## 5. Viewing Data
After loading the dataset, we will take a look at the first few and last few rows to understand its structure and contents.
- **View the top 5 rows of `df_realestate`**


In [53]:
df_realestate.head()

Unnamed: 0_level_0,Type,Zoning Class,Lot Frontage,Lot Area,Alley,Lot Shape,Land Contour,Lot Config,Land Slope,Nbhd,Location Condition,Bldg Type,House Style,OvQual,Overall Cond,Built,Year Remod Add,Roof Style,Roof Material,Exterior Primary,Masonry/Veneer,Masonry/Veneer Area,Exterior Qual,Exterior Cond,Foundation,Basement Height,Basement Cond,Basement Exposure,Basement Finish,Basement Finished Area,Basement Unfinished Area,Basement Area,Heating Qual,CentralAir,Electrical,1st Floor Area,2nd Floor Area,Living Area Above Grade,Basement Full Baths,Basement Half baths,Full Baths Above Grade,Half Baths Above Grade,Bedrooms Above Grade,Kitchens Above Grade,Kitchen Qual,Total Rooms Above Grade,Functionality,Fireplaces,Fireplce Qual,Garage Type,Garage Yr Built,Garage Finish,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Wood Deck Area,Open Porch Area,Enclosed Porch Area,3 Season Porch Area,Screen Porch Area,Pool Area,Pool Qual,Fence,Sale Type,Sale Condition,Sale Price
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1
1,2-STORY 1946 & NEWER,Resid Low Density,65.0,8450,,Regular,Level,Inside lot,Gentle,College Creek,Normal,1-family Detached,2 story,7,5,2003,2003,Gable,Composite Shingle,Vinyl Siding,Brick Face,196.0,Good,Average,Poured Contrete,"Good (90-99"")",Average,No Exposure,Good Living Quarters,706,150,856,Excellent,Y,Standard Circuit Breakers & Romex,856,854,1710,1,0,2,1,3,1,Good,8,Typical Functionality,0,No Fireplace,Attached to home,2003.0,Rough Finished,2,548,Average,Average,Paved,0,61,0,0,0,0,No Pool,No Fence,Warranty Deed - Conventional,Normal Sale,208500
2,1-STORY 1946 & NEWER,Resid Low Density,80.0,9600,,Regular,Level,Frontage on 2 sides,Gentle,Veenker,Adjacent Feeder St,1-family Detached,1 story,6,8,1976,1976,Gable,Composite Shingle,Metal Siding,,0.0,Average,Average,Cinder Block,"Good (90-99"")",Average,Good Exposure,Avg Living Quarters,978,284,1262,Excellent,Y,Standard Circuit Breakers & Romex,1262,0,1262,0,1,2,0,3,1,Average,6,Typical Functionality,1,Average,Attached to home,1976.0,Rough Finished,2,460,Average,Average,Paved,298,0,0,0,0,0,No Pool,No Fence,Warranty Deed - Conventional,Normal Sale,181500
3,2-STORY 1946 & NEWER,Resid Low Density,68.0,11250,,Slightly irregular,Level,Inside lot,Gentle,College Creek,Normal,1-family Detached,2 story,7,5,2001,2002,Gable,Composite Shingle,Vinyl Siding,Brick Face,162.0,Good,Average,Poured Contrete,"Good (90-99"")",Average,Min Exposure,Good Living Quarters,486,434,920,Excellent,Y,Standard Circuit Breakers & Romex,920,866,1786,1,0,2,1,3,1,Good,6,Typical Functionality,1,Average,Attached to home,2001.0,Rough Finished,2,608,Average,Average,Paved,0,42,0,0,0,0,No Pool,No Fence,Warranty Deed - Conventional,Normal Sale,223500
4,2-STORY 1945 & OLDER,Resid Low Density,60.0,9550,,Slightly irregular,Level,Corner lot,Gentle,Crawford,Normal,1-family Detached,2 story,7,5,1915,1970,Gable,Composite Shingle,Wood Siding,,0.0,Average,Average,Brick & Tile,"Typical (80-89"")",Good,No Exposure,Avg Living Quarters,216,540,756,Good,Y,Standard Circuit Breakers & Romex,961,756,1717,1,0,1,0,3,1,Good,7,Typical Functionality,1,Good,Detached from home,1998.0,Unfinished,3,642,Average,Average,Paved,0,35,272,0,0,0,No Pool,No Fence,Warranty Deed - Conventional,"Abnormal Sale - trade, foreclosure, short sale",140000
5,2-STORY 1946 & NEWER,Resid Low Density,84.0,14260,,Slightly irregular,Level,Frontage on 2 sides,Gentle,Northridge,Normal,1-family Detached,2 story,8,5,2000,2000,Gable,Composite Shingle,Vinyl Siding,Brick Face,350.0,Good,Average,Poured Contrete,"Good (90-99"")",Average,Avg Exposure,Good Living Quarters,655,490,1145,Excellent,Y,Standard Circuit Breakers & Romex,1145,1053,2198,1,0,2,1,4,1,Good,9,Typical Functionality,1,Average,Attached to home,2000.0,Rough Finished,3,836,Average,Average,Paved,192,84,0,0,0,0,No Pool,No Fence,Warranty Deed - Conventional,Normal Sale,250000


- **View a sample of records of `df_realestate`**

In [54]:
df_realestate.sample(5) # Unlike head(), sample() will return random rows from the dataframe

Unnamed: 0_level_0,Type,Zoning Class,Lot Frontage,Lot Area,Alley,Lot Shape,Land Contour,Lot Config,Land Slope,Nbhd,Location Condition,Bldg Type,House Style,OvQual,Overall Cond,Built,Year Remod Add,Roof Style,Roof Material,Exterior Primary,Masonry/Veneer,Masonry/Veneer Area,Exterior Qual,Exterior Cond,Foundation,Basement Height,Basement Cond,Basement Exposure,Basement Finish,Basement Finished Area,Basement Unfinished Area,Basement Area,Heating Qual,CentralAir,Electrical,1st Floor Area,2nd Floor Area,Living Area Above Grade,Basement Full Baths,Basement Half baths,Full Baths Above Grade,Half Baths Above Grade,Bedrooms Above Grade,Kitchens Above Grade,Kitchen Qual,Total Rooms Above Grade,Functionality,Fireplaces,Fireplce Qual,Garage Type,Garage Yr Built,Garage Finish,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Wood Deck Area,Open Porch Area,Enclosed Porch Area,3 Season Porch Area,Screen Porch Area,Pool Area,Pool Qual,Fence,Sale Type,Sale Condition,Sale Price
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1
26,1-STORY 1946 & NEWER,Resid Low Density,110.0,14230,,Regular,Level,Corner lot,Gentle,Northridge Heights,Normal,1-family Detached,1 story,8,5,2007,2007,Gable,Composite Shingle,Vinyl Siding,Stone,640.0,Good,Average,Poured Contrete,"Good (90-99"")",Average,No Exposure,Unfinshed,0,1566,1566,Excellent,Y,Standard Circuit Breakers & Romex,1600,0,1600,0,0,2,0,3,1,Good,7,Typical Functionality,1,Good,Attached to home,2007.0,Rough Finished,3,890,Average,Average,Paved,0,56,0,0,0,0,No Pool,No Fence,Warranty Deed - Conventional,Normal Sale,256300
1280,1-1/2 STORY ALL AGES,Commercial,60.0,7500,,Regular,Level,Inside lot,Gentle,Iowa DOT and Rail Road,Normal,1-family Detached,1.5 story: 2nd level fin,4,4,1920,1950,Gable,Composite Shingle,Metal Siding,,0.0,Average,Good,Cinder Block,"Typical (80-89"")",Average,No Exposure,Unfinshed,0,698,698,Average,Y,Fuse Box over 60 AMP and all Romex wiring (Ave...,698,430,1128,0,0,1,0,2,1,Average,6,Typical Functionality,0,No Fireplace,Detached from home,1980.0,Rough Finished,2,528,Average,Average,Paved,30,0,164,0,0,0,No Pool,No Fence,Court Officer Deed/Estate,"Abnormal Sale - trade, foreclosure, short sale",68400
752,2-STORY 1946 & NEWER,Resid Low Density,,7750,,Regular,Level,Inside lot,Gentle,Gilbert,Adjacent Railroad,1-family Detached,2 story,7,5,2003,2003,Gable,Composite Shingle,Vinyl Siding,,0.0,Good,Average,Poured Contrete,"Good (90-99"")",Average,No Exposure,Unfinshed,0,660,660,Excellent,Y,Standard Circuit Breakers & Romex,660,660,1320,0,0,2,1,3,1,Good,6,Typical Functionality,0,No Fireplace,Attached to home,2003.0,Finished,2,400,Average,Average,Paved,0,48,0,0,0,0,No Pool,No Fence,Warranty Deed - Conventional,Normal Sale,162000
576,1-1/2 STORY ALL AGES,Resid Low Density,80.0,8480,,Regular,Level,Inside lot,Gentle,North Ames,Normal,1-family Detached,1.5 story: 2nd level fin,5,5,1947,1950,Gable,Composite Shingle,Metal Siding,,0.0,Average,Average,Cinder Block,"Typical (80-89"")",Average,No Exposure,Avg Rec Room,442,390,832,Average,Y,Standard Circuit Breakers & Romex,832,384,1216,0,0,1,0,2,1,Average,6,Typical Functionality,0,No Fireplace,Detached from home,1947.0,Unfinished,1,336,Average,Average,Paved,158,0,102,0,0,0,No Pool,No Fence,Court Officer Deed/Estate,"Abnormal Sale - trade, foreclosure, short sale",118500
365,2-STORY 1946 & NEWER,Resid Low Density,,18800,,Slightly irregular,Level,Frontage on 2 sides,Gentle,Northwest Ames,Normal,1-family Detached,2 story,6,5,1976,1976,Gable,Composite Shingle,Hard Board,Brick Face,120.0,Average,Average,Poured Contrete,"Good (90-99"")",Average,Min Exposure,Good Living Quarters,712,84,796,Average,Y,Standard Circuit Breakers & Romex,790,784,1574,1,0,2,1,3,1,Average,6,Typical Functionality,1,Average,Attached to home,1976.0,Finished,2,566,Average,Average,Paved,306,111,0,0,0,0,No Pool,No Fence,Warranty Deed - Conventional,Normal Sale,190000


- **View the bottom 2 rows of `df_realestate`**

In [55]:
# by default, tail() returns the last 5 rows of the dataframe
df_realestate.tail(n=2)

Unnamed: 0_level_0,Type,Zoning Class,Lot Frontage,Lot Area,Alley,Lot Shape,Land Contour,Lot Config,Land Slope,Nbhd,Location Condition,Bldg Type,House Style,OvQual,Overall Cond,Built,Year Remod Add,Roof Style,Roof Material,Exterior Primary,Masonry/Veneer,Masonry/Veneer Area,Exterior Qual,Exterior Cond,Foundation,Basement Height,Basement Cond,Basement Exposure,Basement Finish,Basement Finished Area,Basement Unfinished Area,Basement Area,Heating Qual,CentralAir,Electrical,1st Floor Area,2nd Floor Area,Living Area Above Grade,Basement Full Baths,Basement Half baths,Full Baths Above Grade,Half Baths Above Grade,Bedrooms Above Grade,Kitchens Above Grade,Kitchen Qual,Total Rooms Above Grade,Functionality,Fireplaces,Fireplce Qual,Garage Type,Garage Yr Built,Garage Finish,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Wood Deck Area,Open Porch Area,Enclosed Porch Area,3 Season Porch Area,Screen Porch Area,Pool Area,Pool Qual,Fence,Sale Type,Sale Condition,Sale Price
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1
1459,1-STORY 1946 & NEWER,Resid Low Density,68.0,9717,,Regular,Level,Inside lot,Gentle,North Ames,Normal,1-family Detached,1 story,5,6,1950,1996,Hip,Composite Shingle,Metal Siding,,0.0,Average,Average,Cinder Block,"Typical (80-89"")",Average,Min Exposure,Good Living Quarters,49,0,1078,Good,Y,Fuse Box over 60 AMP and all Romex wiring (Ave...,1078,0,1078,1,0,1,0,2,1,Good,5,Typical Functionality,0,No Fireplace,Attached to home,1950.0,Unfinished,1,240,Average,Average,Paved,366,0,112,0,0,0,No Pool,No Fence,Warranty Deed - Conventional,Normal Sale,142125
1460,1-STORY 1946 & NEWER,Resid Low Density,75.0,9937,,Regular,Level,Inside lot,Gentle,Edwards,Normal,1-family Detached,1 story,5,6,1965,1965,Gable,Composite Shingle,Hard Board,,0.0,Good,Average,Cinder Block,"Typical (80-89"")",Average,No Exposure,Below Avg Living Quarters,830,136,1256,Good,Y,Standard Circuit Breakers & Romex,1256,0,1256,1,0,1,1,3,1,Average,6,Typical Functionality,0,No Fireplace,Attached to home,1965.0,Finished,1,276,Average,Average,Paved,736,68,0,0,0,0,No Pool,No Fence,Warranty Deed - Conventional,Normal Sale,147500


- **View the info for `df_realestate`**

In [56]:
df_realestate.info() # gives a summary of the dataframe

<class 'pandas.core.frame.DataFrame'>
Index: 1404 entries, 1 to 1460
Data columns (total 68 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Type                      1404 non-null   object 
 1   Zoning Class              1404 non-null   object 
 2   Lot Frontage              1151 non-null   float64
 3   Lot Area                  1404 non-null   int64  
 4   Alley                     84 non-null     object 
 5   Lot Shape                 1404 non-null   object 
 6   Land Contour              1404 non-null   object 
 7   Lot Config                1404 non-null   object 
 8   Land Slope                1404 non-null   object 
 9   Nbhd                      1404 non-null   object 
 10  Location Condition        1404 non-null   object 
 11  Bldg Type                 1404 non-null   object 
 12  House Style               1404 non-null   object 
 13  OvQual                    1404 non-null   int64  
 14  Overall Cond 

In this section of viewing data we did above:
- We used the `head()` method to view the first 5 rows of the DataFrame.
- We used the `tail()` method to view the last 2 rows of the DataFrame.
- We used the `sample()` method to view a random sample of 5 rows from the DataFrame.
- We used the `info()` method to get a summary of the DataFrame, including the number of non-null values and data types of each column.

## 6 Drop and replace columns
In this section, we will drop unnecessary columns from the DataFrame and replace some values in the columns with more meaningful ones. We will also rename some columns for better readability.

Below is the list of columns to drop:

- Zoning Class
- Lot Shape
- Lot Config
- Land Slope
- Bldg Type
- House Style
- Roof Style
- Roof Material
- Exterior Primary
- Masonry/Veneer

- Exterior Qual
- Exterior Cond
- Foundation
- Basement Height
- Basement Cond
- Basement Exposure
- Basement Finish
- Heating Qual
- CentralAir
- Electrical

- Functionality
- Fireplce Qual
- Garage Type
- Garage Qual
- Garage Cond
- Paved Drive
- Pool Qual
- Fence
- Sale Type
- Year Remod Add

In [57]:
print("Dimensions of the dataframe before dropping columns: ", df_realestate.shape)
# Drop columns that are not needed for analysis
# Create a list of columns to drop

columns_to_drop = [
    'Zoning Class', 'Lot Shape', 'Lot Config', 'Land Slope', 'Bldg Type', 'House Style',
    'Roof Style', 'Roof Material', 'Exterior Primary', 'Masonry/Veneer', 'Exterior Qual',
    'Exterior Cond', 'Foundation', 'Basement Height', 'Basement Cond', 'Basement Exposure',
    'Basement Finish', 'Heating Qual', 'CentralAir', 'Electrical', 'Functionality',
    'Fireplce Qual', 'Garage Type', 'Garage Qual', 'Garage Cond', 'Paved Drive',
    'Pool Qual', 'Fence', 'Sale Type', 'Year Remod Add'
]

df_realestate = df_realestate.drop(columns=columns_to_drop)
print("Dimensions of the dataframe after dropping columns: ", df_realestate.shape)

Dimensions of the dataframe before dropping columns:  (1404, 68)
Dimensions of the dataframe after dropping columns:  (1404, 38)


**Task 1: Drop unnecessary columns from the DataFrame.** Achived 

Next we will rename the following columns: 

- Type to Dwelling Type
- OvQual to Overall Quality
- Nbhd to Neighborhood
- Built to Year Built

In [58]:
df_realestate = df_realestate.rename(columns={
    'Type': 'Dwelling Type',
    'OvQual': 'Overall Quality',
    'Nbhd': 'Neighborhood',
    'Built': 'Year Built'
})

## Grouping the data and replacing values
### Section 7a: Grouping based on 'Neighborhood' and replace Values

In [59]:
# Group by 'Neighborhood' and count occurrences
neighborhood_counts = df_realestate.groupby('Neighborhood')['Neighborhood'].count().reset_index(name='Count')
# reset_index() is used to convert the Series back to a DataFrame
# and name the count column 'Count'

# Display the result (only the Neighborhood and its count)
print(neighborhood_counts[['Neighborhood', 'Count']])

                  Neighborhood  Count
0          Bloomington Heights      6
1              Bloomington Hts     11
2                     Bluestem      2
3                    Briardale     16
4                    Brookside     49
5                  Clear Creek     27
6                College Creek    150
7                     Crawford     50
8                      Edwards     89
9                      Gilbert     79
10      Iowa DOT and Rail Road     34
11              Meadow Village     10
12                    Mitchell     48
13                  North Ames    216
14             Northpark Villa      9
15                  Northridge     41
16          Northridge Heights     77
17              Northwest Ames     73
18                    Old Town    103
19                      Sawyer     74
20                 Sawyer West     58
21                    Somerset     86
22  South & West of Iowa State     24
23                 Stone Brook     25
24                  Timberland     36
25          

If we observe the frequency table above, we can see that we have Bloomington Heights and Blookington hts as two separate entries but problem is both are same and there is some spelling/typing error. we will make them same. 

In [60]:
# Correct typos for Blloomington Heights
neighborhood_corrections = {
    'Bloomington Hts': 'Bloomington Heights'
}
df_realestate['Neighborhood'] = df_realestate['Neighborhood'].replace(neighborhood_corrections)

In [61]:
# checking if the correction worked

neighborhood_counts = df_realestate.groupby('Neighborhood')['Neighborhood'].count().reset_index(name='Count')

print(neighborhood_counts[['Neighborhood', 'Count']])

                  Neighborhood  Count
0          Bloomington Heights     17
1                     Bluestem      2
2                    Briardale     16
3                    Brookside     49
4                  Clear Creek     27
5                College Creek    150
6                     Crawford     50
7                      Edwards     89
8                      Gilbert     79
9       Iowa DOT and Rail Road     34
10              Meadow Village     10
11                    Mitchell     48
12                  North Ames    216
13             Northpark Villa      9
14                  Northridge     41
15          Northridge Heights     77
16              Northwest Ames     73
17                    Old Town    103
18                      Sawyer     74
19                 Sawyer West     58
20                    Somerset     86
21  South & West of Iowa State     24
22                 Stone Brook     25
23                  Timberland     36
24                     Veenker     11


After reviewing the frequency table again, we are able to solve the problem of having two different entries for the same neighborhood. 
Next we need to  show the median sale price for each neighborhood.

In [62]:
# Calculate median sale price per neighborhood (sorted for readability)
median_prices = df_realestate.groupby('Neighborhood')['Sale Price'].median().reset_index(name='Median Sale Price')
median_prices_sorted = median_prices.sort_values(by='Median Sale Price', ascending=False)
# sort_values ascending=False will sort the values in descending order
# its easier to read the output and we can see the neighborhoods with the highest median sale prices at the top

# Display the result
print(median_prices_sorted)

                  Neighborhood  Median Sale Price
15          Northridge Heights           315000.0
14                  Northridge           301500.0
22                 Stone Brook           278000.0
23                  Timberland           233975.0
20                    Somerset           225500.0
24                     Veenker           218000.0
6                     Crawford           208812.0
4                  Clear Creek           200000.0
5                College Creek           197200.0
0          Bloomington Heights           191000.0
16              Northwest Ames           182900.0
8                      Gilbert           181000.0
19                 Sawyer West           179950.0
11                    Mitchell           154750.0
13             Northpark Villa           146000.0
12                  North Ames           141000.0
21  South & West of Iowa State           139750.0
1                     Bluestem           137500.0
18                      Sawyer           135000.0


From the above output, we can see that the median sale price for Northridge Height is higest where as Neighborhood of Iowa DOT and Rail Road is the lowest.
### Section 7b: Group based on ‘Dwelling Type’ and replace values
Very similar to the previous section, we will group the data based on Dwelling Type.

In [63]:
# Group by 'Neighborhood' and count occurrences
Dwelling_counts = df_realestate.groupby('Dwelling Type')['Dwelling Type'].count().reset_index(name='Count')
# reset_index() is used to convert the Series back to a DataFrame
# and name the count column 'Count'

# Display the result (only the Neighborhood and its count)
print(Dwelling_counts[['Dwelling Type', 'Count']])

           Dwelling Type  Count
0            1 STORY PUD      9
1   1-1/2 STORY ALL AGES    138
2   1-STORY 1945 & OLDER     62
3   1-STORY 1946 & NEWER    531
4            1-STORY PUD     78
5    2 FAMILY CONVERSION     28
6   2-1/2 STORY ALL AGES     15
7   2-STORY 1945 & OLDER     59
8   2-STORY 1946 & NEWER    298
9            2-STORY PUD     63
10                DUPLEX     46
11           SPLIT FOYER     20
12  SPLIT OR MULTI-LEVEL     57


Similar to the previous section, we have same problem of having two different entries for the same dwelling type. i.e 1 STORY PUD and 1-STORY PUD are two different entries but they are same.
We will make them same and then show the median sale price for each dwelling type.

In [64]:
# Correct typos for 1 STORY PUD
Dwelling_corrections = {
    '1 STORY PUD': '1-STORY PUD'
}
df_realestate['Dwelling Type'] = df_realestate['Dwelling Type'].replace(Dwelling_corrections)

In [65]:
median_prices = df_realestate.groupby('Dwelling Type')['Sale Price'].median().reset_index(name='Median Sale Price')
median_prices_sorted = median_prices.sort_values(by='Median Sale Price', ascending=False)
# Display the result
print(median_prices_sorted)

           Dwelling Type  Median Sale Price
7   2-STORY 1946 & NEWER           215600.0
3            1-STORY PUD           192000.0
11  SPLIT OR MULTI-LEVEL           165500.0
5   2-1/2 STORY ALL AGES           164000.0
2   1-STORY 1946 & NEWER           160000.0
6   2-STORY 1945 & OLDER           157000.0
8            2-STORY PUD           146000.0
10           SPLIT FOYER           140750.0
9                 DUPLEX           136702.5
0   1-1/2 STORY ALL AGES           133716.0
4    2 FAMILY CONVERSION           128250.0
1   1-STORY 1945 & OLDER           100000.0


From above output, we can see that the median sale price for `2-STORY 1946 & NEWER` is highest where as the median sale price for `~1-STORY 1945 & OLDER` is lowest.

## 8. Summarize and Filter Data
### Section 8a: Pivot Neighborhood and Land Contour



In [66]:
# Create pivot table
df_re_pivot = df_realestate.pivot_table(
    index='Neighborhood',
    columns='Land Contour',
    values='Sale Price',  
    aggfunc='median',
    fill_value=None  # Leave NaN values as they are
)

# Display the pivot table
print("Neighborhood vs. Land Contour - Median Sale Price:")
print(df_re_pivot.reset_index())  # reset_index() to make the index a column for better readability

Neighborhood vs. Land Contour - Median Sale Price:
Land Contour                Neighborhood    Banked  Depression  Hillside  \
0                    Bloomington Heights       NaN         NaN       NaN   
1                               Bluestem       NaN         NaN       NaN   
2                              Briardale       NaN         NaN       NaN   
3                              Brookside  207000.0     39300.0   82500.0   
4                            Clear Creek  220250.0    215000.0  186500.0   
5                          College Creek  124900.0    147000.0  124000.0   
6                               Crawford  184250.0    224000.0  204350.0   
7                                Edwards  159500.0     72950.0   94750.0   
8                                Gilbert  154500.0         NaN  239950.0   
9                 Iowa DOT and Rail Road  118400.0     94500.0  102776.0   
10                        Meadow Village       NaN         NaN       NaN   
11                              Mitch


### Section 8B: Describe and Filter for Garage Cars

In this section, we will describe the Garage Cars column, then count the Garage Cars and filter the data for Garage Cars which is less equal to 3.

In [67]:
print("Description of the column `Garage Cars` :", df_realestate['Garage Cars'].describe())

Description of the column `Garage Cars` : count    1404.000000
mean        1.797721
std         0.728482
min         0.000000
25%         1.000000
50%         2.000000
75%         2.000000
max         4.000000
Name: Garage Cars, dtype: float64
 count    1404.000000
mean        1.797721
std         0.728482
min         0.000000
25%         1.000000
50%         2.000000
75%         2.000000
max         4.000000
Name: Garage Cars, dtype: float64


In [68]:

print(f"Number of non-null values in the 'Garage Cars' column before filteration is : {df_realestate['Garage Cars'].count()}")

Number of non-null values in the 'Garage Cars' column before filteration is : 1404


In [69]:
df_realestate = df_realestate[df_realestate['Garage Cars'] <= 3]

print(f"Number of non-null values in the 'Garage Cars' column after filtering: {df_realestate['Garage Cars'].count()}")

Number of non-null values in the 'Garage Cars' column after filtering: 1399


Basically, we removed the outliers from the Garage Cars column and filtered the data for Garage Cars which is less than or equal to 3. Meaning we will only keep the records which have Garage Cars less than or equal to 3 for further analysis.

In [70]:
df_realestate.drop("Garage Area", axis=1, inplace=True)

## Section 8c: Describe and Filter for Sale price

In this section, we will describe the Sale Price column, then count the Sale Price that are more then 500,000 and filter the data for Sale Price which is more then 500000 and short based on largest valye is at the top.

In [78]:
df_realestate["Sale Price"].describe()

count      1399.000000
mean     183887.932094
std       79325.966763
min       34900.000000
25%      132500.000000
50%      165500.000000
75%      215100.000000
max      755000.000000
Name: Sale Price, dtype: float64

In [81]:
df_realestate[df_realestate['Sale Price'] > 500000].sort_values(by='Sale Price', ascending=False)

Unnamed: 0_level_0,Dwelling Type,Lot Frontage,Lot Area,Alley,Land Contour,Neighborhood,Location Condition,Overall Quality,Overall Cond,Year Built,Masonry/Veneer Area,Basement Finished Area,Basement Unfinished Area,Basement Area,1st Floor Area,2nd Floor Area,Living Area Above Grade,Basement Full Baths,Basement Half baths,Full Baths Above Grade,Half Baths Above Grade,Bedrooms Above Grade,Kitchens Above Grade,Kitchen Qual,Total Rooms Above Grade,Fireplaces,Garage Yr Built,Garage Finish,Garage Cars,Wood Deck Area,Open Porch Area,Enclosed Porch Area,3 Season Porch Area,Screen Porch Area,Pool Area,Sale Condition,Sale Price
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1
692,2-STORY 1946 & NEWER,104.0,21535,,Level,Northridge,Normal,10,6,1994,1170.0,1455,989,2444,2444,1872,4316,0,1,3,1,4,1,Excellent,10,2,1994.0,Finished,3,382,50,0,0,0,0,Normal Sale,755000
1183,2-STORY 1946 & NEWER,160.0,15623,,Level,Northridge,Normal,10,5,1996,0.0,2096,300,2396,2411,2065,4476,1,0,3,1,4,1,Excellent,10,2,1996.0,Finished,3,171,78,0,0,0,555,"Abnormal Sale - trade, foreclosure, short sale",745000
1170,2-STORY 1946 & NEWER,118.0,35760,,Level,Northridge,Normal,10,5,1995,1378.0,1387,543,1930,1831,1796,3627,1,0,3,1,4,1,Good,10,1,1995.0,Finished,3,361,76,0,0,0,0,Normal Sale,625000
899,1-STORY 1946 & NEWER,100.0,12919,,Level,Northridge Heights,Normal,9,5,2009,760.0,2188,142,2330,2364,0,2364,1,0,2,1,2,1,Excellent,11,2,2009.0,Finished,3,0,67,0,0,0,0,Home was not completed when last assessed (ass...,611657
804,2-STORY 1946 & NEWER,107.0,13891,,Level,Northridge Heights,Normal,9,5,2008,424.0,0,1734,1734,1734,1088,2822,0,0,3,1,4,1,Excellent,12,1,2009.0,Rough Finished,3,52,170,0,0,192,0,Home was not completed when last assessed (ass...,582933
1047,2-STORY 1946 & NEWER,85.0,16056,,Level,Stone Brook,Normal,9,5,2005,208.0,240,1752,1992,1992,876,2868,0,0,3,1,4,1,Excellent,11,1,2005.0,Finished,3,214,108,0,0,0,0,Home was not completed when last assessed (ass...,556581
441,1-STORY 1946 & NEWER,105.0,15431,,Level,Northridge Heights,Normal,10,5,2008,200.0,1767,788,3094,2402,0,2402,1,0,2,0,2,1,Excellent,10,2,2008.0,Finished,3,0,72,0,0,170,0,Normal Sale,555000
770,2-STORY 1946 & NEWER,47.0,53504,,Hillside,Stone Brook,Normal,8,5,2003,603.0,1416,234,1650,1690,1589,3279,1,0,3,1,4,1,Excellent,12,1,2003.0,Finished,3,503,36,0,0,210,0,Normal Sale,538000
179,1-STORY 1946 & NEWER,63.0,17423,,Level,Stone Brook,Normal,9,5,2008,748.0,1904,312,2216,2234,0,2234,1,0,2,0,1,1,Excellent,9,1,2009.0,Finished,3,0,60,0,0,0,0,Home was not completed when last assessed (ass...,501837


In [82]:
print(f"Number of non-null values in the 'Sale Price' column before filteration is : {df_realestate[df_realestate['Sale Price'] > 500000].shape}")


Number of non-null values in the 'Sale Price' column before filteration is : (9, 37)


In [83]:
df_re_over500K=df_realestate[df_realestate['Sale Price'] > 500000]

In [85]:
df_re_over500K.iloc[:,[0,5,36]]

Unnamed: 0_level_0,Dwelling Type,Neighborhood,Sale Price
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
179,1-STORY 1946 & NEWER,Stone Brook,501837
441,1-STORY 1946 & NEWER,Northridge Heights,555000
692,2-STORY 1946 & NEWER,Northridge,755000
770,2-STORY 1946 & NEWER,Stone Brook,538000
804,2-STORY 1946 & NEWER,Northridge Heights,582933
899,1-STORY 1946 & NEWER,Northridge Heights,611657
1047,2-STORY 1946 & NEWER,Stone Brook,556581
1170,2-STORY 1946 & NEWER,Northridge,625000
1183,2-STORY 1946 & NEWER,Northridge,745000


In [None]:
Unlike previous with out further analysis, for this section we cannot simply remove the outliers but one thing that we could do is merge the outliers.

**Save `df_realestate` to a CSV file**
Finally, we will save the cleaned and transformed DataFrame to a CSV file for future use.


In [86]:
df_realestate.to_csv('data/Real Estate Data - Week 2.csv', index=False)