Wczytanie danych oraz wyświetlenie pierwszych kilku wierszy

In [1]:
import pandas as pd

df = pd.read_csv('dataset/House_Rent_Dataset.csv')
df.head()

Unnamed: 0,Posted On,BHK,Rent,Size,Floor,Area Type,Area Locality,City,Furnishing Status,Tenant Preferred,Bathroom,Point of Contact
0,2022-05-18,2,10000,1100,Ground out of 2,Super Area,Bandel,Kolkata,Unfurnished,Bachelors/Family,2,Contact Owner
1,2022-05-13,2,20000,800,1 out of 3,Super Area,"Phool Bagan, Kankurgachi",Kolkata,Semi-Furnished,Bachelors/Family,1,Contact Owner
2,2022-05-16,2,17000,1000,1 out of 3,Super Area,Salt Lake City Sector 2,Kolkata,Semi-Furnished,Bachelors/Family,1,Contact Owner
3,2022-07-04,2,10000,800,1 out of 2,Super Area,Dumdum Park,Kolkata,Unfurnished,Bachelors/Family,1,Contact Owner
4,2022-05-09,2,7500,850,1 out of 2,Carpet Area,South Dum Dum,Kolkata,Unfurnished,Bachelors,1,Contact Owner


Informacje o danych oraz Statystyki opisowe

In [2]:
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4746 entries, 0 to 4745
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Posted On          4746 non-null   object
 1   BHK                4746 non-null   int64 
 2   Rent               4746 non-null   int64 
 3   Size               4746 non-null   int64 
 4   Floor              4746 non-null   object
 5   Area Type          4746 non-null   object
 6   Area Locality      4746 non-null   object
 7   City               4746 non-null   object
 8   Furnishing Status  4746 non-null   object
 9   Tenant Preferred   4746 non-null   object
 10  Bathroom           4746 non-null   int64 
 11  Point of Contact   4746 non-null   object
dtypes: int64(4), object(8)
memory usage: 445.1+ KB


Unnamed: 0,BHK,Rent,Size,Bathroom
count,4746.0,4746.0,4746.0,4746.0
mean,2.08386,34993.45,967.490729,1.965866
std,0.832256,78106.41,634.202328,0.884532
min,1.0,1200.0,10.0,1.0
25%,2.0,10000.0,550.0,1.0
50%,2.0,16000.0,850.0,2.0
75%,3.0,33000.0,1200.0,2.0
max,6.0,3500000.0,8000.0,10.0


Sprawdzenie brakujących wartości

Przykład kodowania one-hot dla kolumny 'Furnishing Status'

In [3]:
df = pd.get_dummies(df, columns=['Furnishing Status'], drop_first=True)


Kolumna 'Floor' ma format np. '2 out of 5'. Możemy ją podzielić na dwie kolumny: 'Floor Level' i 'Total Floors'.

In [4]:
def split_floor(x):
    parts = x.split(' out of ')
    floor_level = parts[0]
    total_floors = parts[1] if len(parts) > 1 else None
    return pd.Series([floor_level, total_floors])

df[['Floor Level', 'Total Floors']] = df['Floor'].apply(split_floor)

df.drop('Floor', axis=1, inplace=True)


Zamień wartości tekstowe na numeryczne, np. 'Ground' na 0.

In [5]:
df['Floor Level'] = df['Floor Level'].replace({'Ground': 0, 'Lower Basement': -1, 'Upper Basement': -2})
df['Floor Level'] = pd.to_numeric(df['Floor Level'], errors='coerce')

df['Total Floors'] = pd.to_numeric(df['Total Floors'], errors='coerce')

Usunięcie jednostki 'sqft' i przekonwertanie jej na typ numeryczny.

In [6]:
print(df['Size'].dtype)
print(df['Size'].isnull().sum())
df['Size'] = df['Size'].astype(str)
import numpy as np

# Zamień 'nan' na NaN z numpy
df['Size'] = df['Size'].replace('nan', np.nan)

# Usuń ' sqft' i przekonwertuj na float
df['Size'] = df['Size'].str.replace(' sqft', '')
df['Size'] = df['Size'].astype(float)


int64
0


Przekształcenie kolumny Posted On

In [7]:
df['Posted On'] = pd.to_datetime(df['Posted On'], format='%Y-%m-%d')

Kolumny takie jak 'Area Locality' i 'Point of Contact' mogą być usunięte

In [8]:
df.drop(['Area Locality', 'Point of Contact'], axis=1, inplace=True)

Przekształcenie innych kolumn kategorycznych

In [9]:
categorical_cols = ['Area Type', 'City', 'Tenant Preferred']
df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

Podział danych na cechy (X) i etykietę (y):

In [10]:
# Zmienna docelowa
y = df['Rent']

# Cechy
X = df.drop('Rent', axis=1)

Podział danych na zbiory treningowe i testowe:

In [11]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Zapis pełnego, oczyszczonego zestawu danych

In [15]:
# Zapis oczyszczonego zestawu danych do pliku CSV
df.to_csv('data/cleaned_data.csv', index=False)

Zapis zbiorów treningowego i testowego

In [13]:
# Zapis zbioru treningowego
X_train.to_csv('data/X_train.csv', index=False)
y_train.to_csv('data/y_train.csv', index=False)

# Zapis zbioru testowego
X_test.to_csv('data/X_test.csv', index=False)
y_test.to_csv('data/y_test.csv', index=False)