In [5]:
import pandas as pd

In [6]:
#Adding Column Names from Kaggle Source
column_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']

In [7]:
# accessing csv data in pandas data frame to perfrom tasks on data easily
# adding column names to data frame

df = pd.read_csv('data/housing.csv', header=None, delimiter=r"\s+", names=column_names)

In [8]:
df.tail(5)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273.0,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,0.573,6.12,76.7,2.2875,1,273.0,21.0,396.9,9.08,20.6
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273.0,21.0,396.9,5.64,23.9
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273.0,21.0,393.45,6.48,22.0
505,0.04741,0.0,11.93,0,0.573,6.03,80.8,2.505,1,273.0,21.0,396.9,7.88,11.9


## Column Meaning from original file

### Variables

### There are 14 attributes in each case of the dataset. They are

    1. CRIM - per capita crime rate by town
    2. ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
    3. INDUS - proportion of non-retail business acres per town.
    4. CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
    5. NOX - nitric oxides concentration (parts per 10 million)
    6. RM - average number of rooms per dwelling
    7. AGE - proportion of owner-occupied units built prior to 1940
    8. DIS - weighted distances to five Boston employment centres 
    9. RAD - index of accessibility to radial highways
    10. TAX - full-value property-tax rate per $10,000
    11. PTRATIO - pupil-teacher ratio by town
    12. B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
    13. LSTAT - % lower status of the population
    14. MEDV - Median value of owner-occupied homes in $1000's

# Undestood meaning behind the technical words with help of ChatGPT - GPT4 Model in less time


### Converting Column names for meaningful names


## Boston Housing Dataset New Column Names & Descriptions

- **CRIM** -> **Crime_Rate_Per_Capita**
- **ZN** -> **Large_Lot_Zone_Proportion**
- **INDUS** -> **Non_Retail_Business_Proportion**
- **CHAS** -> **Near_Charles_River**
- **NOX** -> **Nitric_Oxide_Concentration**
- **RM** -> **Average_Rooms_Per_Home**
- **AGE** -> **Proportion_Older_Homes**
- **DIS** -> **Distance_to_Employment_Centers**
- **RAD** -> **Access_to_Radial_Highways**
- **TAX** -> **Property_Tax_Rate**
- **PTRATIO** -> **Student_Teacher_Ratio**
- **B** -> **Proportion_of_Black_Residents**
- **LSTAT** -> **Lower_Status_Population_Percentage**
- **MEDV** -> **Median_Home_Value_Thousands**


## Column Descriptions

- **Crime_Rate_Per_Capita**: This is the crime rate per person in the area. A higher number means more crime.
- **Large_Lot_Zone_Proportion**: This shows what percentage of the land in the area is set aside for larger homes (those with more than 25,000 square feet of land).
- **Non_Retail_Business_Proportion**: This indicates the percentage of the town's land used for businesses that aren't retail (like factories or offices).
- **Near_Charles_River**: This is a yes/no (1 or 0) indicator showing whether the area is next to the Charles River.
- **Nitric_Oxide_Concentration**: This measures the level of air pollution from nitric oxides, which are harmful gases produced by cars and factories.
- **Average_Rooms_Per_Home**: This is the average number of rooms in houses in the area.
- **Proportion_Older_Homes**: This shows the percentage of houses that were built before 1940.
- **Distance_to_Employment_Centers**: This measures how far the area is from five major employment centers in Boston, adjusting for ease of access.
- **Access_to_Radial_Highways**: This is a rating of how easy it is to get to highways from the area.
- **Property_Tax_Rate**: This shows the property tax rate. Higher values mean higher taxes.
- **Student_Teacher_Ratio**: This is the number of students for each teacher in schools in the area. A lower number usually means smaller class sizes.
- **Proportion_of_Black_Residents**: This is a formula that measures the proportion of black residents in the area compared to a historical average.
- **Lower_Status_Population_Percentage**: This indicates the percentage of residents considered to be of lower socio-economic status.
- **Median_Home_Value_Thousands**: This is the median value of houses in the area, reported in thousands of dollars. A higher number indicates more expensive homes.


In [9]:
# New Column Names
new_column_names = [
    'Crime_Rate_Per_Capita', 
    'Large_Lot_Zone_Proportion', 
    'Non_Retail_Business_Proportion', 
    'Near_Charles_River', 
    'Nitric_Oxide_Concentration', 
    'Average_Rooms_Per_Home', 
    'Proportion_Older_Homes', 
    'Distance_to_Employment_Centers', 
    'Access_to_Radial_Highways', 
    'Property_Tax_Rate', 
    'Student_Teacher_Ratio', 
    'Proportion_of_Black_Residents', 
    'Lower_Status_Population_Percentage', 
    'Median_Home_Value_Thousands'
]

In [12]:
#Changing column data in data frame
df = pd.read_csv('data/housing.csv', header=None, delimiter=r"\s+", names=new_column_names)

In [13]:
df.head(5)

Unnamed: 0,Crime_Rate_Per_Capita,Large_Lot_Zone_Proportion,Non_Retail_Business_Proportion,Near_Charles_River,Nitric_Oxide_Concentration,Average_Rooms_Per_Home,Proportion_Older_Homes,Distance_to_Employment_Centers,Access_to_Radial_Highways,Property_Tax_Rate,Student_Teacher_Ratio,Proportion_of_Black_Residents,Lower_Status_Population_Percentage,Median_Home_Value_Thousands
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   Crime_Rate_Per_Capita               506 non-null    float64
 1   Large_Lot_Zone_Proportion           506 non-null    float64
 2   Non_Retail_Business_Proportion      506 non-null    float64
 3   Near_Charles_River                  506 non-null    int64  
 4   Nitric_Oxide_Concentration          506 non-null    float64
 5   Average_Rooms_Per_Home              506 non-null    float64
 6   Proportion_Older_Homes              506 non-null    float64
 7   Distance_to_Employment_Centers      506 non-null    float64
 8   Access_to_Radial_Highways           506 non-null    int64  
 9   Property_Tax_Rate                   506 non-null    float64
 10  Student_Teacher_Ratio               506 non-null    float64
 11  Proportion_of_Black_Residents       506 non-n

### 🏠 Key Features in each column

## 📊
- **Crime_Rate_Per_Capita**: 
  - **The higher the number, the more crimes; the lower the number, the less crime.**
- **Large_Lot_Zone_Proportion**: 
  - **Indicates the percentage of large lots; more means larger residential areas.**
- **Non_Retail_Business_Proportion**: 
  - **Higher values suggest more industrial or commercial land use.**
- **Near_Charles_River**: 
  - **1 if near the river, 0 if not; proximity may influence property values.**
- **Nitric_Oxide_Concentration**: 
  - **Greater concentrations indicate higher pollution levels.**
- **Average_Rooms_Per_Home**: 
  - **More rooms generally mean larger dwellings.**
- **Proportion_Older_Homes**: 
  - **A higher percentage indicates an older neighborhood.**
- **Distance_to_Employment_Centers**: 
  - **Shorter distances suggest better job accessibility.**
- **Access_to_Radial_Highways**: 
  - **Higher index means better highway access.**
- **Property_Tax_Rate**: 
  - **Higher rates imply higher property taxes.**
- **Student_Teacher_Ratio**: 
  - **Lower ratios are preferable, indicating more teachers per student.**
- **Proportion_of_Black_Residents**: 
  - **Measures the concentration of black residents compared to historical averages.**
- **Lower_Status_Population_Percentage**: 
  - **A higher percentage points to a higher proportion of the lower socioeconomic status population.**
- **Median_Home_Value_Thousands**: 
  - **Higher values indicate more expensive homes.**