# Cosine Similarity, 
This metric measures the cosine of the angle between two non-zero vectors in a multi-dimensional space. It is particularly useful for interval and ratio-scaled attributes, where the magnitude of the vectors is not as important as their direction.

### Importing required Libraries

In [3]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from scipy.spatial.distance import cosine

### Load Datasets

In [4]:
# Load datasets
adult_df = pd.read_csv("../adult/adult_trim.data", header=None) # No header
titanic_df = pd.read_csv('../titanic/titanic_trim.csv') # Has header

# Rename columns for clarity
adult_df.columns = ["age", "workclass", "fnlwgt", "education", "education_num", 
                    "marital_status", "occupation", "relationship", "race", "sex", 
                    "capital_gain", "capital_loss", "hours_per_week", "native_country", "income"]
adult_df.dropna(inplace=True)

In [5]:
adult_df

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,29,Local-gov,115585,Some-college,10,Never-married,Handlers-cleaners,Not-in-family,White,Male,0,0,50,United-States,<=50K
96,48,Self-emp-not-inc,191277,Doctorate,16,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,1902,60,United-States,>50K
97,37,Private,202683,Some-college,10,Married-civ-spouse,Sales,Husband,White,Male,0,0,48,United-States,>50K
98,48,Private,171095,Assoc-acdm,12,Divorced,Exec-managerial,Unmarried,White,Female,0,0,40,England,<=50K


In [6]:
titanic_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
150,151,0,2,"Bateman, Rev. Robert James",male,51.0,0,0,S.O.P. 1166,12.5250,,S
151,152,1,1,"Pears, Mrs. Thomas (Edith Wearne)",female,22.0,1,0,113776,66.6000,C2,S
152,153,0,3,"Meo, Mr. Alfonzo",male,55.5,0,0,A.5. 11206,8.0500,,S
153,154,0,3,"van Billiard, Mr. Austin Blyler",male,40.5,0,2,A/5. 851,14.5000,,S


### Select relevant columns from Adult dataset (mix of nominal and ratio-scaled)

In [7]:
adult_df = adult_df[["age", "workclass", "education", "education_num", "sex"]]

adult_df

Unnamed: 0,age,workclass,education,education_num,sex
0,39,State-gov,Bachelors,13,Male
1,50,Self-emp-not-inc,Bachelors,13,Male
2,38,Private,HS-grad,9,Male
3,53,Private,11th,7,Male
4,28,Private,Bachelors,13,Female
...,...,...,...,...,...
95,29,Local-gov,Some-college,10,Male
96,48,Self-emp-not-inc,Doctorate,16,Male
97,37,Private,Some-college,10,Male
98,48,Private,Assoc-acdm,12,Female


### Encode nominal attributes as integers for processing

In [8]:
label_encoders = {}
for column in adult_df.columns:
    if adult_df[column].dtype == object:
        le = LabelEncoder()
        adult_df[column] = le.fit_transform(adult_df[column])
        label_encoders[column] = le

adult_df


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  adult_df[column] = le.fit_transform(adult_df[column])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  adult_df[column] = le.fit_transform(adult_df[column])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  adult_df[column] = le.fit_transform(adult_df[column])


Unnamed: 0,age,workclass,education,education_num,sex
0,39,6,7,13,1
1,50,5,7,13,1
2,38,3,9,9,1
3,53,3,1,7,1
4,28,3,7,13,0
...,...,...,...,...,...
95,29,2,12,10,1
96,48,5,8,16,1
97,37,3,12,10,1
98,48,3,5,12,0


### Clean and preprocess Titanic dataset

In [9]:
titanic_df.dropna(inplace=True)
titanic_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S
21,22,1,2,"Beesley, Mr. Lawrence",male,34.0,0,0,248698,13.0,D56,S
23,24,1,1,"Sloper, Mr. William Thompson",male,28.0,0,0,113788,35.5,A6,S
27,28,0,1,"Fortune, Mr. Charles Alexander",male,19.0,3,2,19950,263.0,C23 C25 C27,S
52,53,1,1,"Harper, Mrs. Henry Sleeper (Myna Haxtun)",female,49.0,1,0,PC 17572,76.7292,D33,C
54,55,0,1,"Ostby, Mr. Engelhart Cornelius",male,65.0,0,1,113509,61.9792,B30,C


### Select relevant columns from Titanic dataset (mix of nominal and ratio-scaled)

In [10]:
titanic_df = titanic_df[["Age", "Sex", "Pclass", "Fare", "Embarked"]]
titanic_df

Unnamed: 0,Age,Sex,Pclass,Fare,Embarked
1,38.0,female,1,71.2833,C
3,35.0,female,1,53.1,S
6,54.0,male,1,51.8625,S
10,4.0,female,3,16.7,S
11,58.0,female,1,26.55,S
21,34.0,male,2,13.0,S
23,28.0,male,1,35.5,S
27,19.0,male,1,263.0,S
52,49.0,female,1,76.7292,C
54,65.0,male,1,61.9792,C


### Encode Nominal as Integers for processing

In [11]:
label_encoders_titanic = {}
for column in titanic_df.columns:
    if titanic_df[column].dtype == object:
        le = LabelEncoder()
        titanic_df[column] = le.fit_transform(titanic_df[column])
        label_encoders[column] = le

titanic_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  titanic_df[column] = le.fit_transform(titanic_df[column])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  titanic_df[column] = le.fit_transform(titanic_df[column])


Unnamed: 0,Age,Sex,Pclass,Fare,Embarked
1,38.0,0,1,71.2833,0
3,35.0,0,1,53.1,1
6,54.0,1,1,51.8625,1
10,4.0,0,3,16.7,1
11,58.0,0,1,26.55,1
21,34.0,1,2,13.0,1
23,28.0,1,1,35.5,1
27,19.0,1,1,263.0,1
52,49.0,0,1,76.7292,0
54,65.0,1,1,61.9792,0


### Combine the datasets into a list for further processing

In [12]:

# Combine the datasets into a list for further processing
datasets = {
    "Adult Dataset": adult_df,
    "Titanic Dataset": titanic_df
}

### Compute Cosine Similarity

In [13]:
def cosine_similarity(a, b):
    """Calculate the Cosine Similarity between two vectors."""
    try:
        return 1 - cosine(a, b)  # scipy returns distance, so we subtract from 1 to get similarity
    except Exception as e:
        return np.nan

# Function to create the Cosine Similarity matrix
def calculate_cosine_similarity_matrix(dataset):
    n = len(dataset)
    cosine_matrix = np.zeros((n, n))
    
    for i in range(n):
        for j in range(n):
            cosine_matrix[i, j] = cosine_similarity(dataset.iloc[i].values, dataset.iloc[j].values)
    
    return pd.DataFrame(cosine_matrix)

### Calculate Cosine SImilarity

#### For Adult Dataset

In [14]:
cosine_matrix_adult = calculate_cosine_similarity_matrix(adult_df)
cosine_matrix_adult

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,1.000000,0.996120,0.992260,0.967533,0.991062,0.993478,0.969143,0.987947,0.986347,0.996862,...,0.985347,0.986888,0.975612,0.991905,0.994913,0.976034,0.998883,0.987173,0.991174,0.987921
1,0.996120,1.000000,0.995356,0.985101,0.980434,0.987759,0.986363,0.996085,0.974868,0.998307,...,0.996532,0.985455,0.965694,0.988742,0.991649,0.968522,0.997556,0.985626,0.998577,0.989193
2,0.992260,0.995356,1.000000,0.973431,0.980291,0.991643,0.980816,0.996382,0.980890,0.995442,...,0.991265,0.996296,0.981828,0.996523,0.990753,0.984786,0.993527,0.996606,0.991648,0.997422
3,0.967533,0.985101,0.973431,1.000000,0.934177,0.948206,0.997445,0.987336,0.923486,0.976317,...,0.994705,0.951232,0.913116,0.953271,0.958745,0.919479,0.971956,0.951403,0.990072,0.959821
4,0.991062,0.980434,0.980291,0.934177,1.000000,0.996981,0.936376,0.965545,0.997933,0.988638,...,0.963205,0.983161,0.983478,0.988883,0.995978,0.983085,0.991562,0.982858,0.974700,0.984550
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0.976034,0.968522,0.984786,0.919479,0.983085,0.991265,0.933007,0.967070,0.991735,0.976569,...,0.954591,0.994996,0.998900,0.994023,0.982860,1.000000,0.976136,0.995441,0.959385,0.992139
96,0.998883,0.997556,0.993527,0.971956,0.991562,0.994680,0.972707,0.989034,0.986548,0.999319,...,0.989125,0.987346,0.974227,0.991879,0.997575,0.976136,1.000000,0.987524,0.994687,0.990804
97,0.987173,0.985626,0.996606,0.951403,0.982858,0.993749,0.962654,0.986846,0.987841,0.988844,...,0.977234,0.999692,0.993583,0.998922,0.988614,0.995441,0.987524,1.000000,0.979044,0.998150
98,0.991174,0.998577,0.991648,0.990072,0.974700,0.982658,0.989681,0.994338,0.967490,0.996608,...,0.998355,0.979346,0.955547,0.982578,0.988623,0.959385,0.994687,0.979044,1.000000,0.985352


#### For TItanic Dataset

In [15]:
cosine_matrix_titanic = calculate_cosine_similarity_matrix(titanic_df)
cosine_matrix_titanic

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,17,18,19,20,21,22,23,24,25,26
0,1.0,0.99555,0.950379,0.953797,0.794964,0.753466,0.983664,0.914004,0.996915,0.949365,...,0.974767,0.969902,0.923689,0.763595,0.992686,0.990109,0.99271,0.980876,0.894276,0.985364
1,0.99555,1.0,0.975197,0.927993,0.847999,0.811493,0.996103,0.872311,0.999761,0.974291,...,0.949966,0.988348,0.883987,0.820538,0.999551,0.998842,0.999548,0.958475,0.932092,0.965439
2,0.950379,0.975197,1.0,0.830554,0.944041,0.920198,0.990421,0.742794,0.971814,0.999899,...,0.857635,0.997505,0.758953,0.925919,0.980901,0.98353,0.980661,0.872046,0.988775,0.88392
3,0.953797,0.927993,0.830554,1.0,0.609816,0.565279,0.897217,0.97128,0.931194,0.827611,...,0.985645,0.866148,0.974731,0.578173,0.917484,0.915496,0.918877,0.983677,0.749304,0.983066
4,0.794964,0.847999,0.944041,0.609816,1.0,0.99674,0.889719,0.480649,0.840094,0.945111,...,0.640159,0.918423,0.501963,0.997874,0.861941,0.869931,0.861284,0.661958,0.98141,0.680559
5,0.753466,0.811493,0.920198,0.565279,0.99674,1.0,0.858714,0.422964,0.802464,0.921172,...,0.589457,0.890551,0.444933,0.999505,0.826884,0.836496,0.826546,0.612193,0.968018,0.631707
6,0.983664,0.996103,0.990421,0.897217,0.889719,0.858714,1.0,0.827375,0.994507,0.98967,...,0.920013,0.997659,0.840815,0.866089,0.998122,0.998803,0.99818,0.930736,0.959776,0.939538
7,0.914004,0.872311,0.742794,0.97128,0.480649,0.422964,0.827375,1.0,0.879364,0.740421,...,0.981288,0.788166,0.999692,0.436753,0.858832,0.849936,0.859375,0.975362,0.636737,0.969558
8,0.996915,0.999761,0.971814,0.931194,0.840094,0.802464,0.994507,0.879364,1.0,0.971077,...,0.954297,0.985949,0.890788,0.81166,0.999026,0.997672,0.998942,0.962606,0.926405,0.968991
9,0.949365,0.974291,0.999899,0.827611,0.945111,0.921172,0.98967,0.740421,0.971077,1.0,...,0.855681,0.997135,0.756698,0.926885,0.980147,0.982543,0.979817,0.870352,0.988988,0.882138


### Explanation
Cosine Similarity Calculation: This metric focuses on the orientation of the data points in a multi-dimensional space. A value of 1 means that the vectors are identical, while 0 indicates orthogonal vectors (no similarity), and -1 indicates opposite vectors.


Handling Different Data Types: Cosine similarity works best with interval and ratio-scaled data, as it considers the direction (or relative distribution) rather than the magnitude of the data points.

### Observation and Analysis
The resulting matrices will show the similarity between data points based on the angle between their corresponding vectors. This is especially useful in text mining and other applications where the magnitude of features is less important than their relative importance.


Cosine Similarity is not affected by the scale of the data, making it useful when the attributes have been normalized or when dealing with sparse data.