# <b>DSAI 3201 Machine Learning Project<b>

## Data Understanding and Preprocessing

## <b>1. Dataset Exploration<b> 

Examine the structure and features of the
UCI Indoor Localization WiFi Dataset (520 RSSI features, along
with building, floor, and coordinate information).

In [80]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

In [46]:
# Load the datset
train_df = pd.read_csv('trainingData.csv')
test_df = pd.read_csv('validationData.csv')

# Display the first few rows of the dataset
print("Training dataset: ")
print(train_df.head())

print("\nTest dataset: ")
print(test_df.head())


Training dataset: 
   WAP001  WAP002  WAP003  WAP004  WAP005  WAP006  WAP007  WAP008  WAP009  \
0     100     100     100     100     100     100     100     100     100   
1     100     100     100     100     100     100     100     100     100   
2     100     100     100     100     100     100     100     -97     100   
3     100     100     100     100     100     100     100     100     100   
4     100     100     100     100     100     100     100     100     100   

   WAP010  ...  WAP520  LONGITUDE      LATITUDE  FLOOR  BUILDINGID  SPACEID  \
0     100  ...     100 -7541.2643  4.864921e+06      2           1      106   
1     100  ...     100 -7536.6212  4.864934e+06      2           1      106   
2     100  ...     100 -7519.1524  4.864950e+06      2           1      103   
3     100  ...     100 -7524.5704  4.864934e+06      2           1      102   
4     100  ...     100 -7632.1436  4.864982e+06      0           0      122   

   RELATIVEPOSITION  USERID  PHONEID   TIME

In [48]:
# Display the dataset's structure and basic information
print("\nTrain Dataset Info:")
train_df.info()

print("\nTest Dataset Info:")
test_df.info()



Train Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19937 entries, 0 to 19936
Columns: 529 entries, WAP001 to TIMESTAMP
dtypes: float64(2), int64(527)
memory usage: 80.5 MB

Test Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1111 entries, 0 to 1110
Columns: 529 entries, WAP001 to TIMESTAMP
dtypes: float64(2), int64(527)
memory usage: 4.5 MB


In [124]:
last_9_columns = train_df.iloc[:, -9:]

# Apply the describe function on the last 9 columns
print(last_9_columns.describe())

# Show summary statistics of the dataset
print("\n-------Train Summary Statistics-------")
print(train_df.describe())
print("\n-------Test Summary Statistics-------")
print(test_df.describe())



          LONGITUDE      LATITUDE         FLOOR    BUILDINGID       SPACEID  \
count  19937.000000  1.993700e+04  19937.000000  19937.000000  19937.000000   
mean   -7464.275947  4.864871e+06      1.674575      1.212820    148.429954   
std      123.402010  6.693318e+01      1.223078      0.833139     58.342106   
min    -7691.338400  4.864746e+06      0.000000      0.000000      1.000000   
25%    -7594.737000  4.864821e+06      1.000000      0.000000    110.000000   
50%    -7423.060900  4.864852e+06      2.000000      1.000000    129.000000   
75%    -7359.193000  4.864930e+06      3.000000      2.000000    207.000000   
max    -7300.818990  4.865017e+06      4.000000      2.000000    254.000000   

       RELATIVEPOSITION        USERID       PHONEID     TIMESTAMP  
count      19937.000000  19937.000000  19937.000000  1.993700e+04  
mean           1.833024      9.068014     13.021869  1.371421e+09  
std            0.372964      4.988720      5.362410  5.572054e+05  
min            1

In [132]:

def get_outlier_values(df, column):
    """
    Function to find outliers in a specific column using the IQR method and return the outlier values.

    Parameters:
    - df: The DataFrame containing the data.
    - column: The column name (string) for which we need to find the outliers.

    Returns:
    - A list or array of outlier values from the specified column.
    """
    # Check if the column exists in the DataFrame
    if column not in df.columns:
        raise ValueError(f"Column '{column}' not found in the DataFrame")

    # Calculate Q1 (25th percentile) and Q3 (75th percentile)
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    
    # Calculate IQR
    IQR = Q3 - Q1
    
    # Define the lower and upper bounds for outliers
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    # Find the outlier values in the specified column
    outliers = df[column][(df[column] < lower_bound) | (df[column] > upper_bound)]
    count = outliers.count()
    return count

# Example usage
print("Outliers in each column:")
for i in last_9_columns.columns:
    outliers = get_outlier_values(train_df, i)
    print(f"{i}: {outliers}")


Outliers in each column:
LONGITUDE: 0
LATITUDE: 0
FLOOR: 0
BUILDINGID: 0
SPACEID: 0
RELATIVEPOSITION: 3329
USERID: 0
PHONEID: 437
TIMESTAMP: 1502


In [130]:
# Check the column names to understand the features
print("\nColumns in Train Dataset:")
print(train_df.columns)
print("\nColumns in Test Dataset:")
print(test_df.columns)



Columns in Train Dataset:
Index(['WAP001', 'WAP002', 'WAP003', 'WAP004', 'WAP005', 'WAP006', 'WAP007',
       'WAP008', 'WAP009', 'WAP010',
       ...
       'WAP520', 'LONGITUDE', 'LATITUDE', 'FLOOR', 'BUILDINGID', 'SPACEID',
       'RELATIVEPOSITION', 'USERID', 'PHONEID', 'TIMESTAMP'],
      dtype='object', length=529)

Columns in Test Dataset:
Index(['WAP001', 'WAP002', 'WAP003', 'WAP004', 'WAP005', 'WAP006', 'WAP007',
       'WAP008', 'WAP009', 'WAP010',
       ...
       'WAP520', 'LONGITUDE', 'LATITUDE', 'FLOOR', 'BUILDINGID', 'SPACEID',
       'RELATIVEPOSITION', 'USERID', 'PHONEID', 'TIMESTAMP'],
      dtype='object', length=529)


## <b>2. Data Cleaning and Preparation:<b>

- Handle any missing values and normalize the RSSI
measurements

In [78]:
# checking data types
# print(train_df.dtypes)

# Check for missing values
print("\n-------Train Missing Values-------")
print(train_df.isnull().sum())

print("\n-------Test Missing Values-------")
print(test_df.isnull().sum())

# Filling missing values with the mean
train_df.fillna(train_df.mean(), inplace=True)
test_df.fillna(test_df.mean(), inplace=True)



-------Train Missing Values-------
WAP001              0
WAP002              0
WAP003              0
WAP004              0
WAP005              0
                   ..
SPACEID             0
RELATIVEPOSITION    0
USERID              0
PHONEID             0
TIMESTAMP           0
Length: 529, dtype: int64

-------Test Missing Values-------
WAP001              0
WAP002              0
WAP003              0
WAP004              0
WAP005              0
                   ..
SPACEID             0
RELATIVEPOSITION    0
USERID              0
PHONEID             0
TIMESTAMP           0
Length: 529, dtype: int64


- Scale features appropriately and encode any categorical
variables if needed

- Optionally, apply feature reduction techniques (e.g., PCA)
to reduce noise and improve model efficiency