In [7]:
df.duplicated(subset=None, keep='first')

duplicated_rows_df = df[df.duplicated()]

print(duplicated_rows_df)

Empty DataFrame
Columns: [id, date, price, bedrooms, bathrooms, sqft_living, sqft_lot, floors, waterfront, view, condition, grade, sqft_above, sqft_basement, yr_built, yr_renovated, zipcode, lat, long, sqft_living15, sqft_lot15]
Index: []

[0 rows x 21 columns]


This tells us that there are no duplicate rows based on all columns in the dataframe - we don't need to worry about deleting rows which are exact duplicates of each other for all columns.

**1) What is the relationship between a house's salesprice and its area in sq ft?**

In [10]:
df['price'].isna().sum()

0

In [11]:
df['sqft_living'].isna().sum()

0

It's useful to know that neither of our price or sqft_living columns contain null values - we don't need to correct for these during data cleaning.

We may also need to consider (during data cleaning) adding a new column to the dataframe and adding price ranges to it which are a little easier to manipulate.

In [13]:
df['waterfront'].isna().sum()

2376

In [14]:
df['waterfront'].value_counts()

0.0    19075
1.0      146
Name: waterfront, dtype: int64


2376 values for the waterfront column are missing. We may need to replace these later with a meaningful value during data cleaning. 

Only 146 of the remaining houses have a view of the waterfront, whilst 19075 don't. This seems strange, given the location of King County: it is located along Elliott Bay.

In [19]:
df['yr_built'].unique()

array([1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 2003, 1942,
       1927, 1977, 1900, 1979, 1994, 1916, 1921, 1969, 1947, 1968, 1985,
       1941, 1915, 1909, 1948, 2005, 1929, 1981, 1930, 1904, 1996, 2000,
       1984, 2014, 1922, 1959, 1966, 1953, 1950, 2008, 1991, 1954, 1973,
       1925, 1989, 1972, 1986, 1956, 2002, 1992, 1964, 1952, 1961, 2006,
       1988, 1962, 1939, 1946, 1967, 1975, 1980, 1910, 1983, 1978, 1905,
       1971, 2010, 1945, 1924, 1990, 1914, 1926, 2004, 1923, 2007, 1976,
       1949, 1999, 1901, 1993, 1920, 1997, 1943, 1957, 1940, 1918, 1928,
       1974, 1911, 1936, 1937, 1982, 1908, 1931, 1998, 1913, 2013, 1907,
       1958, 2012, 1912, 2011, 1917, 1932, 1944, 1902, 2009, 1903, 1970,
       2015, 1934, 1938, 1919, 1906, 1935])

In [20]:
df['yr_built'].isna().sum()

0

In [None]:
lambda x: x - df_with_sales_year[yr_built]

no need lambda: numpy! simple delete

**Date**

In [None]:
df['date'].isna().sum()

Entries in the date column (the date each house was sold) are in the format month-day-year. It may be useful to convert these dates into years only.

**Bedrooms**

At first glance, there seems to be one clear outlier in the  bedrooms column: one house has 33 bedrooms. One would expect this house to be extremely expensive, and very big. However, looking at that particular row demonstrates that its price is not at the higher end of the range of house prices (it is only USD 100000 more than the average house price). We can safely assume that this is an anomaly.

In [None]:
df.loc[df['bedrooms'] == 33]

In [None]:
df['bedrooms'].isna().sum()

In [None]:
df['bedrooms'].value_counts()

**Bathrooms**

The values in the bathrooms column describe the number of bathrooms per bedroom per house. To calculate the number of bathrooms in a house, we could add a new column to the dataframe and simply multiply the number of bathrooms per bedroom per house.

In [None]:
df['bathrooms'].isna().sum()

In [None]:
df['bathrooms'].value_counts()

**Grade**

The 'grade' column doesn't include null values.
The 'grade' column is categorical: houses are assigned individual grades based on the overall condition of the housing unit based on the King County grading system.
However, we can't be sure whether a higher grade indicates a higher overall condition.


In [None]:
df['grade'].isna().sum()

In [None]:
df['grade'].value_counts()

**Floors**

Some of the values in the floor column seem a little strange - the highest number of floors in a house is 3.5, suggesting that some houses may have mezzanines in them.

**View**

The view column tells us whether the property has been viewed or not. However, the values included for this factor are not straightforward.

In [None]:
df['view'].isna().sum()

In [None]:
df['view'].value_counts()

It could be that the column tells us how many times a property has been viewed. We might expect a property with a high number of views to be less desirable and harder to sell.

**Condition**

This tells us how good the condition of the house is overall. We can assume that the higher the associated value, the better the condition the house is in.

In [None]:
df['condition'].value_counts()

In [None]:
df['condition'].isna().sum()

In [None]:
df_price_condition = df[['price', 'condition']]

df_price_condition.head()

**Sqft above** describes the total area of the house, minus the area of the basement. We could assume that the larger the area of the house, the higher the price of the house.

In [None]:
df['sqft_above'].isna().sum()

**Sqft_basement** describes the total area of the basement of a house. Not all houses will have basements, and those which do not have basements we will assume will have a value of '0' in this column. We can see below that there are some values which are missing for this factor, as noted by the '?' value.

In [None]:
df['sqft_basement'].isna().sum()

In [None]:
df['sqft_basement'].value_counts()

**Yr_built** describes the year a house was built. We don't yet know whether/how this factor will influence the price of a house - perhaps a house built a long time ago will have some value, as long as its condition is not too low or if it has been recently renovated.

In [None]:
df['yr_built'].value_counts()

In [None]:
df['yr_built'].isna().sum()

**Yr_renovated** tells us the year a house has been renovated, if at all. We can assume that '0' suggests the house has never been renovated.

In [None]:
df['yr_renovated'].value_counts()

In [None]:
df['yr_renovated'].isna().sum()

**Zipcode** The zipcode of a property will tell us where the property is located. Some zipcodes will be more in demand than others, as they are more attractive areas to live than others. To work with this factor, we may need to find a way to group the zipcodes together by larger area, in order to make our data more manageable (we could do this using the **longitude** and **latitude** figures we are given, for example).

In [None]:
df['zipcode'].value_counts()

In [None]:
df['zipcode'].isna().sum()

**Sqft_living** This factor tells us the square footage of the house living space for the nearest 15 neighbours for each house, whilst **sq_ft lot 15** tells us the size of the plot of land of the nearest 5 neighbours for each house.