## Introduction
In this course, we started by building intuition for model based learning, explored how the linear regression model worked, understood how the two different approaches to model fitting worked, and some techniques for cleaning, transforming, and selecting features. In this guided project, you can practice what you learned in this course by exploring ways to improve the models we built.

You'll work with housing data for the city of Ames, Iowa, United States from 2006 to 2010. You can read more about why the data was collected here. You can also read about the different columns in the data here.

Let's start by setting up a pipeline of functions that will let us quickly iterate on different models.

Train----> transform featuress()----> select features ()------> train_and_test()-----> rmse,avg_mse

## Instructions

* Import pandas, matplotlib, and numpy into the environment. Import the classes you need from scikit-learn as well.
* Read `AmesHousing.tsv` into a pandas data frame.
* For the following functions, we recommend creating them in the first few cells in the notebook. This way, you can add cells to the end of the notebook to do experiments and update the functions in these cells.
  * Create a function named `transform_features()` that, for now, just returns the `train` data frame.
  * Create a function named `select_features()` that, for now, just returns the `Gr Liv Area` and `SalePrice` columns from the train data frame.
  * Create a function named `train_and_test()` that, for now:
    * Selects the first 1460 rows from from data and assign to `train`.
    * Selects the remaining rows from data and assign to `test`.
    * Trains a model using all numerical columns except the `SalePrice column` (the target column) from the data frame returned from `select_features()`
    * Tests the model on the test set and returns the `RMSE` value.


In [1]:
#Import pandas, matplotlib, and numpy into the environment.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sb

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

#Read AmesHousing.tsv into a pandas data frame.
df=pd.read_csv('AmesHousing.tsv',delimiter='\t')

In [2]:
#Create a function named transform_features() that, for now, just returns the train data frame.
def transform_features(df):
    return df

#Create a function named select_features() that, for now, just returns the Gr Liv Area and 
#SalePrice columns from the train data frame.
def select_features(df):
    return df[['Gr Liv Area,','SalePrice']]
def select_features(df):
    return df[["Gr Liv Area", "SalePrice"]]


#Create a function named train_and_test()
def train_and_test(df):
    train_df=df[:1460]
    test_df=df[1460:]
    ## You can use `pd.DataFrame.select_dtypes()` to specify column types
    ## and return only those columns as a data frame.
    numeric_train = train_df.select_dtypes(include=['integer', 'float'])
    numeric_test = test_df.select_dtypes(include=['integer', 'float'])
    features=numeric_train.columns.drop(['SalePrice'])
    lr = LinearRegression()
    lr.fit(train_df[features], train_df["SalePrice"])
    predictions = lr.predict(test_df[features])
    mse = mean_squared_error(test_df["SalePrice"], predictions)
    rmse = np.sqrt(mse)
    return rmse



transform_df = transform_features(df)
filtered_df = select_features(transform_df)
rmse = train_and_test(filtered_df)

rmse



57088.25161263909

## Feature Engineering
Let's now start removing features with many missing values, diving deeper into potential categorical features, and transforming text and numerical columns. Update `transform_features()` so that any column from the data frame with more than 25% (or another cutoff value) missing values is dropped. You also need to remove any columns that leak information about the sale (e.g. like the year the sale happened). In general, the goal of this function is to:

* remove features that we don't want to use in the model, just based on the number of missing values or data leakage
* transform features into the proper format (numerical to categorical, scaling numerical, filling in missing values, etc)
* create new features by combining other features

Next, you need to get more familiar with the remaining columns by reading the data documentation for each column, determining what transformations are necessary (if any), and more. As we mentioned earlier, succeeding in predictive modeling (and competitions like Kaggle) is highly dependent on the quality of features the model has. Libraries like scikit-learn have made it quick and easy to simply try and tweak many different models, but cleaning, selecting, and transforming features are still more of an art that requires a bit of human ingenuity.

## Instructions

* As we mentioned earlier, we recommend adding some cells to explore and experiment with different features (before rewriting these functions).
* The `transform_features()` function shouldn't modify the train data frame and instead return a new one entirely. This way, we can keep using train in the experimentation cells.
* Which columns contain less than 5% missing values?
  * For numerical columns that meet this criteria, let's fill in the missing values using the most popular value for that column.
* What new features can we create, that better capture the information in some of the features?
  * An example of this would be the `years_until_remod` feature we created in the last lesson.
* Which columns need to be dropped for other reasons?
* Which columns aren't useful for machine learning?
* Which columns leak data about the final sale?


In [3]:
percent_df=df.isnull().sum() * 100/len(df)
#drop columns with missing values greater than 5%
missing_val_greater_5=percent_df[(percent_df>5)]
df=df.drop(missing_val_greater_5.index,axis=1)

In [4]:
#select the missing values less than 5%
missing_val_less_5=df[percent_df[(percent_df>0) & (percent_df<5)].index]

#For numerical columns that meet this criteria, let's fill in the missing 
#values using the most popular value for that column.
numeric_df=missing_val_less_5.select_dtypes(include=['integer', 'float'])
### Compute the most common value for each column in `fixable_nmeric_missing_cols`.
replacement_values_dict = numeric_df.mode().to_dict(orient='records')[0]
replacement_values_dict

{'Mas Vnr Area': 0.0,
 'BsmtFin SF 1': 0.0,
 'BsmtFin SF 2': 0.0,
 'Bsmt Unf SF': 0.0,
 'Total Bsmt SF': 0.0,
 'Bsmt Full Bath': 0.0,
 'Bsmt Half Bath': 0.0,
 'Garage Cars': 2.0,
 'Garage Area': 0.0}

In [5]:
## Use `pd.DataFrame.fillna()` to replace missing values.
df = df.fillna(replacement_values_dict)

In [6]:
#drop object columns with one or more missing values

text_cols = df.select_dtypes(include=['object'])
text_cols_count=text_cols.isnull().sum()
## Filter Series to columns containing *any* missing values
missing_text_cols = text_cols_count[text_cols_count > 0]

df = df.drop(missing_text_cols.index, axis=1)

In [7]:
## Verify that every column has 0 missing values
df.isnull().sum()

Order             0
PID               0
MS SubClass       0
MS Zoning         0
Lot Area          0
                 ..
Mo Sold           0
Yr Sold           0
Sale Type         0
Sale Condition    0
SalePrice         0
Length: 64, dtype: int64

In [8]:
##create new feature columns

df['years_until_remod']= df['Year Remod/Add'] - df['Year Built']

df['years_since_remod'] = df['Yr Sold'] - df['Year Remod/Add']

df['years_before_sale']= df['Yr Sold'] - df['Year Built']

let's check and see if there are negative values in our new columns

In [9]:
df['years_until_remod'][df['years_until_remod'] < 0]

850   -1
Name: years_until_remod, dtype: int64

In [10]:
df['years_since_remod'][df['years_since_remod'] < 0]

1702   -1
2180   -2
2181   -1
Name: years_since_remod, dtype: int64

In [11]:
df['years_before_sale'][df['years_before_sale'] < 0]

2180   -1
Name: years_before_sale, dtype: int64

apparently, there are negative values in our dataset which i wasn't expecting, but let's dive deeeper to find out reasons for this..

In [12]:
df[['years_since_remod','Yr Sold','Year Remod/Add']].loc[2180]

years_since_remod      -2
Yr Sold              2007
Year Remod/Add       2009
Name: 2180, dtype: int64

In [13]:
df[['years_until_remod','Year Remod/Add','Year Built']].loc[850]

years_until_remod      -1
Year Remod/Add       2001
Year Built           2002
Name: 850, dtype: int64

In [14]:
df[['years_before_sale','Yr Sold','Year Built']].loc[2180]

years_before_sale      -1
Yr Sold              2007
Year Built           2008
Name: 2180, dtype: int64

In [15]:
df.isnull().sum().value_counts()

0    67
dtype: int64

from our observations so far, there has to be some discrepancies which speaks to possibly human error when computing the data.

for instance, it is impossible for a house to be built in 2008 and sold in 2007. it is also unlikely for a house to be built in 2002 and remodelled in 2001.

so we are going o drop all the rows with negative values

In [16]:
## Drop rows with negative values for both of these new features
df = df.drop([850, 1702,2180,2181], axis=0)

In [17]:
df.isnull().sum().value_counts()

0    67
dtype: int64

In [18]:
#Which columns need to be dropped for other reasons?

## Drop columns that aren't useful for ML
df = df.drop(["PID", "Order"], axis=1)

## Drop columns that leak info about the final sale
df = df.drop(["Mo Sold", "Sale Condition", "Sale Type", "Yr Sold"], axis=1)  

In [19]:
df.isnull().sum().value_counts()

0    61
dtype: int64

In [20]:
df.columns

Index(['MS SubClass', 'MS Zoning', 'Lot Area', 'Street', 'Lot Shape',
       'Land Contour', 'Utilities', 'Lot Config', 'Land Slope', 'Neighborhood',
       'Condition 1', 'Condition 2', 'Bldg Type', 'House Style',
       'Overall Qual', 'Overall Cond', 'Year Built', 'Year Remod/Add',
       'Roof Style', 'Roof Matl', 'Exterior 1st', 'Exterior 2nd',
       'Mas Vnr Area', 'Exter Qual', 'Exter Cond', 'Foundation',
       'BsmtFin SF 1', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF',
       'Heating', 'Heating QC', 'Central Air', '1st Flr SF', '2nd Flr SF',
       'Low Qual Fin SF', 'Gr Liv Area', 'Bsmt Full Bath', 'Bsmt Half Bath',
       'Full Bath', 'Half Bath', 'Bedroom AbvGr', 'Kitchen AbvGr',
       'Kitchen Qual', 'TotRms AbvGrd', 'Functional', 'Fireplaces',
       'Garage Cars', 'Garage Area', 'Paved Drive', 'Wood Deck SF',
       'Open Porch SF', 'Enclosed Porch', '3Ssn Porch', 'Screen Porch',
       'Pool Area', 'Misc Val', 'SalePrice', 'years_until_remod',
       'years_since

now let's update our `transform_features()` function with what we have done so far

In [21]:
#Create a function named transform_features() that, for now, just returns the train data frame.
def transform_features(df):
    percent_df=df.isnull().sum() * 100/len(df)
#drop columns with missing values greater than 5%
    missing_val_greater_5=percent_df[(percent_df>5)]
    df=df.drop(missing_val_greater_5.index,axis=1)
    
    text_cols = df.select_dtypes(include=['object'])
    text_cols_count=text_cols.isnull().sum()
## Filter Series to columns containing *any* missing values
    missing_text_cols = text_cols_count[text_cols_count > 0]
    df = df.drop(missing_text_cols.index, axis=1)
    
    num_missing = df.select_dtypes(include=['int', 'float']).isnull().sum()
    fixable_numeric_cols = num_missing[(num_missing < len(df)/20) & (num_missing > 0)].sort_values()
    replacement_values_dict = df[fixable_numeric_cols.index].mode().to_dict(orient='records')[0]
    df = df.fillna(replacement_values_dict)
   
    df['years_until_remod']= df['Year Remod/Add'] - df['Year Built']
    df['years_since_remod'] = df['Yr Sold'] - df['Year Remod/Add']
    df['years_before_sale']= df['Yr Sold'] - df['Year Built']
    ## Drop rows with negative values for both of these new features
    df = df.drop([850, 1702,2180,2181], axis=0)
## Drop columns that leak info about the final sale
    df = df.drop(["PID", "Order","Mo Sold", "Sale Condition", "Sale Type", "Yr Sold"], axis=1) 
    return df

def select_features(df):
    return df[["Gr Liv Area", "SalePrice"]]

def train_and_test(df):  
    train = df[:1460]
    test = df[1460:]
    
    ## You can use `pd.DataFrame.select_dtypes()` to specify column types
    ## and return only those columns as a data frame.
    numeric_train = train.select_dtypes(include=['integer', 'float'])
    numeric_test = test.select_dtypes(include=['integer', 'float'])
    
    ## You can use `pd.Series.drop()` to drop a value.
    features = numeric_train.columns.drop("SalePrice")
    lr = LinearRegression()
    lr.fit(train[features], train["SalePrice"])
    predictions = lr.predict(test[features])
    mse = mean_squared_error(test["SalePrice"], predictions)
    rmse = np.sqrt(mse)
    
    return rmse

df = pd.read_csv("AmesHousing.tsv", delimiter="\t")
transform_df = transform_features(df)
filtered_df = select_features(transform_df)
rmse = train_and_test(filtered_df)

rmse

55284.62277814025

## Feature Selection

Now that we have cleaned and transformed a lot of the features in the data set, it's time to move on to feature selection for numerical features.

## Instructions

* Generate a correlation heatmap matrix of the numerical features in the training data set.
  * Which features correlate strongly with our target column, `SalePrice`?
  * Calculate the correlation coefficients for the columns that seem to correlate well with `SalePrice`. Because we have a pipeline in place, it's easy to try different features and see which features result in a better cross validation score.

* Which columns in the data frame should be converted to the categorical data type? All of the columns that can be categorized as nominal variables are candidates for being converted to categorical. Here are some other things you should think about:
  * If a categorical column has hundreds of unique values (or categories), should you keep it? When you dummy code this column, hundreds of columns will need to be added back to the data frame.
  * Which categorical columns have a few unique values but more than 95% of the values in the column belong to a specific category? This would be similar to a low variance numerical feature (no variability in the data for the model to capture).

* Which columns are currently numerical but need to be encoded as categorical instead (because the numbers don't have any semantic meaning)?
* What are some ways we can explore which categorical columns "correlate" well with `SalePrice`?
* Update the logic for the `select_features()` function. This function should take in the new, modified train and test data frames that were returned from `transform_features()`.


In [22]:
#Generate a correlation heatmap matrix of the numerical features in the training data set.
numerical_df=transform_df.select_dtypes(include=['integer', 'float'])
corr_series = numerical_df.corr()  #check the correlation
sorted_corrs = np.abs(corr_series['SalePrice']).sort_values() #sort the dataset according to the target column 'sale price'
print(sorted_corrs)

BsmtFin SF 2         0.006156
Misc Val             0.019264
3Ssn Porch           0.032279
Bsmt Half Bath       0.035852
Low Qual Fin SF      0.037620
Pool Area            0.068445
MS SubClass          0.085056
Overall Cond         0.101498
Screen Porch         0.112310
Kitchen AbvGr        0.119743
Enclosed Porch       0.128656
Bedroom AbvGr        0.143902
Bsmt Unf SF          0.182862
years_until_remod    0.240017
Lot Area             0.267517
2nd Flr SF           0.269707
Bsmt Full Bath       0.276214
Half Bath            0.284974
Open Porch SF        0.316277
Wood Deck SF         0.328158
BsmtFin SF 1         0.439365
Fireplaces           0.474994
TotRms AbvGrd        0.498614
Mas Vnr Area         0.507010
Year Remod/Add       0.532996
years_since_remod    0.534972
Full Bath            0.546108
Year Built           0.558499
years_before_sale    0.558984
1st Flr SF           0.635183
Garage Area          0.641414
Total Bsmt SF        0.644023
Garage Cars          0.648351
Gr Liv Are

from what we can see so far, `Gr Liv Area` and `Overall Qual` have the stronges correlation with `SalePrice`. For now, let's keep only the features that have a correlation of 0.3 or higher. This cutoff is a bit arbitrary and, in general, it's a good idea to experiment with this cutoff. For example, you can train and test models using different cutoffs and see where your model stops improving. 

In [23]:
#select only features with atleast 0.3 correlation
sorted_corrs_more_3=sorted_corrs[sorted_corrs>0.3]
sorted_corrs_more_3

Open Porch SF        0.316277
Wood Deck SF         0.328158
BsmtFin SF 1         0.439365
Fireplaces           0.474994
TotRms AbvGrd        0.498614
Mas Vnr Area         0.507010
Year Remod/Add       0.532996
years_since_remod    0.534972
Full Bath            0.546108
Year Built           0.558499
years_before_sale    0.558984
1st Flr SF           0.635183
Garage Area          0.641414
Total Bsmt SF        0.644023
Garage Cars          0.648351
Gr Liv Area          0.717617
Overall Qual         0.801212
SalePrice            1.000000
Name: SalePrice, dtype: float64

In [24]:
## Drop columns with less than 0.4 correlation with SalePrice
transform_df = transform_df.drop(sorted_corrs[sorted_corrs<0.3].index, axis=1) 

Which columns in the data frame should be converted to the categorical data type? 

All of the columns that can be categorized as nominal variables are candidates for being converted to categorical. Here are some other things you should think about:

* If a categorical column has hundreds of unique values (or categories), should you keep it? When you dummy code this column, hundreds of columns will need to be added back to the data frame.
* Which categorical columns have a few unique values but more than 95% of the values in the column belong to a specific category? This would be similar to a low variance numerical feature (no variability in the data for the model to capture).

to answer this, let's take a look at some of the unique values in the `text_cols`

In [26]:
#select text feature columns
text_cols=transform_df.select_dtypes(include=['object'])
text_unique_count=text_cols.nunique().sort_values()
drop_text_cols=text_unique_count[text_unique_count>10]
#drop text features woth more than 10 unique values
transform_df=transform_df.drop(drop_text_cols.index,axis=1) 

In [27]:
## Select just the remaining text columns and convert to categorical
text_cols = transform_df.select_dtypes(include=['object'])
for col in text_cols:
    transform_df[col] = transform_df[col].astype('category')
    
## Create dummy columns and add back to the dataframe!
transform_df = pd.concat([
    transform_df, 
    pd.get_dummies(transform_df.select_dtypes(include=['category']))
], axis=1).drop(text_cols,axis=1)

at this point, update your `select features()` function

In [39]:
#Create a function named transform_features() that, for now, just returns the train data frame.
def transform_features(df):
    percent_df=df.isnull().sum() * 100/len(df)
#drop columns with missing values greater than 5%
    missing_val_greater_5=percent_df[(percent_df>5)]
    df=df.drop(missing_val_greater_5.index,axis=1)
    
    text_cols = df.select_dtypes(include=['object'])
    text_cols_count=text_cols.isnull().sum()
## Filter Series to columns containing *any* missing values
    missing_text_cols = text_cols_count[text_cols_count > 0]
    df = df.drop(missing_text_cols.index, axis=1)
    
    num_missing = df.select_dtypes(include=['int', 'float']).isnull().sum()
    fixable_numeric_cols = num_missing[(num_missing < len(df)/20) & (num_missing > 0)].sort_values()
    replacement_values_dict = df[fixable_numeric_cols.index].mode().to_dict(orient='records')[0]
    df = df.fillna(replacement_values_dict)
   
    df['years_until_remod']= df['Year Remod/Add'] - df['Year Built']
    df['years_since_remod'] = df['Yr Sold'] - df['Year Remod/Add']
    df['years_before_sale']= df['Yr Sold'] - df['Year Built']
    ## Drop rows with negative values for both of these new features
    df = df.drop([850, 1702,2180,2181], axis=0)
## Drop columns that leak info about the final sale
    df = df.drop(["PID", "Order","Mo Sold", "Sale Condition", "Sale Type", "Yr Sold"], axis=1) 
    return df

#i experimemnted with the coeff threshold before i came up with 0.4
# i tried 0.3 but the rmse is high, and i tried 0.5,0.6...it looks like
#0.4 has better results
def select_features(df, coeff_threshold=0.4, uniq_threshold=10):
    numerical_df = df.select_dtypes(include=['int', 'float'])
    abs_corr_coeffs = numerical_df.corr()['SalePrice'].abs().sort_values()
    df = df.drop(abs_corr_coeffs[abs_corr_coeffs < coeff_threshold].index, axis=1)
    
    #select text feature columns
    text_cols=df.select_dtypes(include=['object'])
    text_unique_count=text_cols.nunique().sort_values()
    drop_text_cols=text_unique_count[text_unique_count>10]
    #drop text features woth more than 10 unique values
    df=df.drop(drop_text_cols.index,axis=1)
    
    text_cols = df.select_dtypes(include=['object'])
    for col in text_cols:
        df[col] = df[col].astype('category')
    df = pd.concat([df, pd.get_dummies(df.select_dtypes(include=['category']))], axis=1).drop(text_cols,axis=1)
    
    return df


def train_and_test(df):  
    train = df[:1460]
    test = df[1460:]
    
    ## You can use `pd.DataFrame.select_dtypes()` to specify column types
    ## and return only those columns as a data frame.
    numeric_train = train.select_dtypes(include=['integer', 'float'])
    numeric_test = test.select_dtypes(include=['integer', 'float'])
    
    ## You can use `pd.Series.drop()` to drop a value.
    features = numeric_train.columns.drop("SalePrice")
    lr = LinearRegression()
    lr.fit(train[features], train["SalePrice"])
    predictions = lr.predict(test[features])
    mse = mean_squared_error(test["SalePrice"], predictions)
    rmse = np.sqrt(mse)
    
    return rmse

df = pd.read_csv("AmesHousing.tsv", delimiter="\t")
transform_df = transform_features(df)
filtered_df = select_features(transform_df)
rmse = train_and_test(filtered_df)

rmse

33468.08890531518

## Train And Test
Now for the final part of the pipeline, training and testing. When iterating on different features, using simple validation is a good idea. Let's add a parameter named k that controls the type of cross validation that occurs.

### Instructions

* The optional `k` parameter should accept integer values, with a default value of `0`.
* When `k` equals `0`, perform holdout validation (what we already implemented):
  * Select the first `1460` rows and assign to `train`.
  * Select the remaining rows and assign to `test`.
  * Train on `train` and test on `test`.
  * Compute the RMSE and return.
* When `k` equals `1`, perform simple cross validation:
  * Shuffle the ordering of the rows in the data frame.
  * Select the first `1460` rows and assign to `fold_one`.
  * Select the remaining rows and assign to `fold_two`.
  * Train on `fold_one` and test on `fold_two`.
  * Train on `fold_two` and test on `fold_one`.
  * Compute the average RMSE and return.
* When `k` is greater than `0`, implement k-fold cross validation using `k` folds:
  * Perform k-fold cross validation using k folds.
  * Calculate the average RMSE value and return this value.

In [47]:
#Create a function named transform_features() that, for now, just returns the train data frame.
def transform_features(df):
    percent_df=df.isnull().sum() * 100/len(df)
#drop columns with missing values greater than 5%
    missing_val_greater_5=percent_df[(percent_df>5)]
    df=df.drop(missing_val_greater_5.index,axis=1)
    
    text_cols = df.select_dtypes(include=['object'])
    text_cols_count=text_cols.isnull().sum()
## Filter Series to columns containing *any* missing values
    missing_text_cols = text_cols_count[text_cols_count > 0]
    df = df.drop(missing_text_cols.index, axis=1)
    
    num_missing = df.select_dtypes(include=['int', 'float']).isnull().sum()
    fixable_numeric_cols = num_missing[(num_missing < len(df)/20) & (num_missing > 0)].sort_values()
    replacement_values_dict = df[fixable_numeric_cols.index].mode().to_dict(orient='records')[0]
    df = df.fillna(replacement_values_dict)
   
    df['years_until_remod']= df['Year Remod/Add'] - df['Year Built']
    df['years_since_remod'] = df['Yr Sold'] - df['Year Remod/Add']
    df['years_before_sale']= df['Yr Sold'] - df['Year Built']
    ## Drop rows with negative values for both of these new features
    df = df.drop([850, 1702,2180,2181], axis=0)
## Drop columns that leak info about the final sale
    df = df.drop(["PID", "Order","Mo Sold", "Sale Condition", "Sale Type", "Yr Sold"], axis=1) 
    return df

#i experimemnted with the coeff threshold before i came up with 0.4
# i tried 0.3 but the rmse is high, and i tried 0.5,0.6...it looks like
#0.4 has better results
def select_features(df, coeff_threshold=0.4, uniq_threshold=10):
    numerical_df = df.select_dtypes(include=['int', 'float'])
    abs_corr_coeffs = numerical_df.corr()['SalePrice'].abs().sort_values()
    df = df.drop(abs_corr_coeffs[abs_corr_coeffs < coeff_threshold].index, axis=1)
    
    #select text feature columns
    text_cols=df.select_dtypes(include=['object'])
    text_unique_count=text_cols.nunique().sort_values()
    drop_text_cols=text_unique_count[text_unique_count>10]
    #drop text features woth more than 10 unique values
    df=df.drop(drop_text_cols.index,axis=1)
    
    text_cols = df.select_dtypes(include=['object'])
    for col in text_cols:
        df[col] = df[col].astype('category')
    df = pd.concat([df, pd.get_dummies(df.select_dtypes(include=['category']))], axis=1).drop(text_cols,axis=1)
    
    return df

def train_and_test(df,k=0):
    ## You can use `pd.DataFrame.select_dtypes()` to specify column types
    ## and return only those columns as a data frame.
    numeric_df = df.select_dtypes(include=['integer', 'float'])
    lr  = LinearRegression()
    features = numeric_df.columns.drop("SalePrice")
    from sklearn.model_selection import KFold
    if k==0:
        train = df[:1460]
        test = df[1460:]
        ## You can use `pd.Series.drop()` to drop a value.
        lr.fit(train[features], train["SalePrice"])
        predictions = lr.predict(test[features])
        mse = mean_squared_error(test["SalePrice"], predictions)
        rmse = np.sqrt(mse)
        return rmse
   
    if k==1:
         #Use the np.random.permutation() function to return a NumPy array of shuffled index values
        df = df.loc[np.random.permutation(len(df))]
        fold_one = df[:1460]
        fold_two = df[1460:]
        lr.fit(fold_one[features], fold_one["SalePrice"])
        predictions1 = lr.predict(fold_two[features])
        lr.fit(fold_two[features], fold_two["SalePrice"])
        predictions2 = lr.predict(fold_one[features])
        mse1 = mean_squared_error(fold_two["SalePrice"], prediction1)
        mse2 = mean_squared_error(fold_one["SalePrice"], prediction2)
        rmse1 = np.sqrt(mse1)
        rmse2 = np.sqrt(mse2)
        avg_rmse = np.mean([rmse1, rmse2])
        
        return avg_rmse
    else:
        kf = KFold(n_splits=k, shuffle=True)
        rmse_values = []
        for train_index, test_index, in kf.split(df):
            train = df.iloc[train_index]
            test = df.iloc[test_index]
            lr.fit(train[features], train["SalePrice"])
            predictions = lr.predict(test[features])
            mse = mean_squared_error(test["SalePrice"], predictions)
            rmse = np.sqrt(mse)
            rmse_values.append(rmse)
        print(rmse_values)
        avg_rmse = np.mean(rmse_values)
        return avg_rmse        

df = pd.read_csv("AmesHousing.tsv", delimiter="\t")
transform_df = transform_features(df)
filtered_df = select_features(transform_df)
rmse = train_and_test(filtered_df, k=4)

rmse


[36550.23773803087, 26985909.036646564, 9427487.033298733, 272324908.0519385]


77193713.58990546