**Title**

**Introduction**

- Provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your proposal
- Clearly state the question you will try to answer with your project
- Identify and describe the dataset that will be used to answer the question

In [24]:
# set up the environment
import random

import altair as alt
import pandas as pd
import numpy as np
from sklearn.compose import make_column_transformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

# Simplify working with large datasets in Altair
alt.data_transformers.disable_max_rows()

# Output dataframes instead of arrays
set_config(transform_output="pandas")

**Preliminary exploratory data analysis:**

- Demonstrate that the dataset can be read from the web into Python.

The dataset was stored at Kaggle in .csv format. It is downloaded and stored as a data file in this project. To read the dataset, simply read in using pd.read_csv.



In [26]:
sonar_data = pd.read_csv ("https://public.dm.files.1drv.com/y4m2gk6aVK0hoCkEA5bOc3KJoELsCkWSPnOYkDd7Kuv2Zm3DlFKsKh5ZsTm1-B1yHNxzHykiyiQdcYF9XhVHZdWzDP8a99bZEeo2HvzX9tDr44xkO690S5B_iFmctI03A8JdF8O7DlNS7ZQN_60TaXIMTJRt0eOekI-rNab-4dvT75LpATYpu7HGorVscmYZxUFCwhO-rYiw0XXNbJc1nld4lpL-v22UvYMGFFrqOdpTto")
sonar_data

Unnamed: 0,Freq_1,Freq_2,Freq_3,Freq_4,Freq_5,Freq_6,Freq_7,Freq_8,Freq_9,Freq_10,...,Freq_52,Freq_53,Freq_54,Freq_55,Freq_56,Freq_57,Freq_58,Freq_59,Freq_60,Label
0,0.0200,0.0371,0.0428,0.0207,0.0954,0.0986,0.1539,0.1601,0.3109,0.2111,...,0.0027,0.0065,0.0159,0.0072,0.0167,0.0180,0.0084,0.0090,0.0032,R
1,0.0453,0.0523,0.0843,0.0689,0.1183,0.2583,0.2156,0.3481,0.3337,0.2872,...,0.0084,0.0089,0.0048,0.0094,0.0191,0.0140,0.0049,0.0052,0.0044,R
2,0.0262,0.0582,0.1099,0.1083,0.0974,0.2280,0.2431,0.3771,0.5598,0.6194,...,0.0232,0.0166,0.0095,0.0180,0.0244,0.0316,0.0164,0.0095,0.0078,R
3,0.0100,0.0171,0.0623,0.0205,0.0205,0.0368,0.1098,0.1276,0.0598,0.1264,...,0.0121,0.0036,0.0150,0.0085,0.0073,0.0050,0.0044,0.0040,0.0117,R
4,0.0762,0.0666,0.0481,0.0394,0.0590,0.0649,0.1209,0.2467,0.3564,0.4459,...,0.0031,0.0054,0.0105,0.0110,0.0015,0.0072,0.0048,0.0107,0.0094,R
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
203,0.0187,0.0346,0.0168,0.0177,0.0393,0.1630,0.2028,0.1694,0.2328,0.2684,...,0.0116,0.0098,0.0199,0.0033,0.0101,0.0065,0.0115,0.0193,0.0157,M
204,0.0323,0.0101,0.0298,0.0564,0.0760,0.0958,0.0990,0.1018,0.1030,0.2154,...,0.0061,0.0093,0.0135,0.0063,0.0063,0.0034,0.0032,0.0062,0.0067,M
205,0.0522,0.0437,0.0180,0.0292,0.0351,0.1171,0.1257,0.1178,0.1258,0.2529,...,0.0160,0.0029,0.0051,0.0062,0.0089,0.0140,0.0138,0.0077,0.0031,M
206,0.0303,0.0353,0.0490,0.0608,0.0167,0.1354,0.1465,0.1123,0.1945,0.2354,...,0.0086,0.0046,0.0126,0.0036,0.0035,0.0034,0.0079,0.0036,0.0048,M


- Clean and wrangle your data into a tidy format

 Data is already in clean format. Each column is a variable, each row is one observation and each value is a cell. 

- Using only training data, summarize the data in at least one table (this is exploratory data analysis).
  
  An example of a useful table could be one that reports the number of observations in each class, the means of the predictor variables you plan to use in your analysis and how many rows have missing data.

First we take a look at the number of rows. 

In [27]:
nb_observations = sonar_data.shape[0]
nb_observations

208

Secondly, we investigate the number of samples in each class, to verify if the sample is balanced. Imbalanced sample will lead to the situation where samples of one class hold the majority vote when they are not supposed to. This will limit the performance of the model, because it will give unnecessary favor to the majority class. 

In [44]:
sample_breakdown = sonar_data["Label"].value_counts()
sample_breakdown

M    111
R     97
Name: Label, dtype: int64

Since we have samples of roughly the same quantity, it is safe to say the sample is balanced. 

- Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do (this is exploratory data analysis).

    An example of a useful visualization could be one that compares the distributions of each of the predictor variables you plan to use in your analysis.

Now I want to group my data into 2 sets: Mine and Rock, to see if there is a visible pattern with their sonar profile. First I will group my data set by labels, then I will aggregate my data by calculating the mean value of each frenquency. Then I will plot the sonar profile using bar plots for both signals, and overlap them to compare the sonar strength in each frequency. 

In [48]:
snoar_data_agg = sonar_data.groupby(["Label"]).mean().reset_index()
snoar_data_agg

Unnamed: 0,Label,Freq_1,Freq_2,Freq_3,Freq_4,Freq_5,Freq_6,Freq_7,Freq_8,Freq_9,...,Freq_51,Freq_52,Freq_53,Freq_54,Freq_55,Freq_56,Freq_57,Freq_58,Freq_59,Freq_60
0,M,0.034989,0.045544,0.05072,0.064768,0.086715,0.111864,0.128359,0.149832,0.213492,...,0.019352,0.016014,0.011643,0.012185,0.009923,0.008914,0.007825,0.00906,0.008695,0.00693
1,R,0.022498,0.030303,0.035951,0.041447,0.062028,0.096224,0.11418,0.117596,0.137392,...,0.012311,0.010453,0.00964,0.009518,0.008567,0.00743,0.007814,0.006677,0.007078,0.006024


We can see that the profile of Mine is different from that of rocks. For example,  the intensity of Freq_9 of the Mine is significantly higher than that of the rock. To visualize the difference, I will make a bar plot of both aggregates and overlap them to compare the difference in intensity.

In [69]:
# sort columns
sorted_cols = ['Label'] + sorted(snoar_data_agg.columns[1:], key=lambda x: int(x.split('_')[1]))

# melt data into long format
sonar_agg_melt = snoar_data_agg.melt(id_vars = "Label",
                                     var_name="Frequency",
                                     value_name = "Intensity"
                                    )
sonar_agg_melt

Unnamed: 0,Label,Frequency,Intensity
0,M,Freq_1,0.034989
1,R,Freq_1,0.022498
2,M,Freq_2,0.045544
3,R,Freq_2,0.030303
4,M,Freq_3,0.050720
...,...,...,...
115,R,Freq_58,0.006677
116,M,Freq_59,0.008695
117,R,Freq_59,0.007078
118,M,Freq_60,0.006930


In [71]:
agg_bar_plot = alt.Chart(sonar_agg_melt).mark_bar().encode(
    x = alt.X("Frequency", sort=sorted_cols[1:]),
    y = alt.Y("Intensity"),
    color = "Label"
)

agg_bar_plot

**Methods**

- Explain how you will conduct either your data analysis and which variables/columns you will use. Note - you do not need to use all variables/columns that exist in the raw data set. In fact, that's often not a good idea. For each variable think: is this a useful variable for prediction?
- Describe at least one way that you will visualize the results

**Expected outcomes and significance:**
- What do you expect to find?
We expect to use our classifier to predict whether the object is a mine or a rock.
- What impact could such findings have?
Such findings are influential because rocks contain information about the landscape and the geological history of that area, and our daily lives depend on minerals and metals.
- What future questions could this lead to?
This could lead to future questions such as what is the size of the rock/mine and what is the type of the rock/mine. 