## Objective : Learning the implementation of CBRSusing TF-IDTF matrix
## Problem
### 1. Create the TF-IDTF matrix for the music recommender example data given below:
#### sg1='drums,guitar,beat'
#### sg2='drums,guitar,orchestra'
#### sg3='guitar,beat'
#### sg4='classical,symphony,orchestra'
#### sg5='guitar,classical,orchestra'
#### sg6='classical,symphony'
#### whose ratings are 0,1,0,1,0,0

### 2. Reduce the dimension using PCA

### 3. Carryout inline prediction using Logistic regression

### Step 1: Create documents

In [2]:
sg1='drums,guitar,beat,guitar,guitar,beats'
sg2='drums,guitar,orchestra'
sg3='guitar,beat'
sg4='classical,symphony,orchestra'
sg5='guitar,classical,orchestra'
sg6='classical,symphony'


### Step 2 : Merge the documents to a create a single corpus

In [4]:
corpus=[sg1,sg2,sg3,sg4,sg5,sg6]
corpus


['drums,guitar,beat,guitar,guitar,beats',
 'drums,guitar,orchestra',
 'guitar,beat',
 'classical,symphony,orchestra',
 'guitar,classical,orchestra',
 'classical,symphony']

### Step 3: Import the library

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf=TfidfVectorizer()


### Step 4: TF-IDTF matrix creation

In [6]:
result=tfidf.fit_transform(corpus)
print(result) # viewing the sparse representation


  (0, 3)	0.34926005140142785
  (0, 4)	0.7580418492355107
  (0, 0)	0.34926005140142785
  (0, 1)	0.42591946163300803
  (1, 3)	0.668719891595794
  (1, 4)	0.48380155055600843
  (1, 5)	0.5645792825314363
  (2, 4)	0.5861569567966913
  (2, 0)	0.8101975203608325
  (3, 5)	0.5420919460564738
  (3, 2)	0.5420919460564738
  (3, 6)	0.642084608164228
  (4, 4)	0.5182242665631911
  (4, 5)	0.6047493735197427
  (4, 2)	0.6047493735197427
  (5, 2)	0.6451024322949592
  (5, 6)	0.764096101185661


In [7]:
#print(tfidf.vocabulary_)
print(result.toarray()) # to see the matrix representation


[[0.34926005 0.42591946 0.         0.34926005 0.75804185 0.
  0.        ]
 [0.         0.         0.         0.66871989 0.48380155 0.56457928
  0.        ]
 [0.81019752 0.         0.         0.         0.58615696 0.
  0.        ]
 [0.         0.         0.54209195 0.         0.         0.54209195
  0.64208461]
 [0.         0.         0.60474937 0.         0.51822427 0.60474937
  0.        ]
 [0.         0.         0.64510243 0.         0.         0.
  0.7640961 ]]


#### TF_IDTF matrix creation for the the training_test data

In [8]:
t1='classical'
t2='drums,beat'
corpus=[sg1,sg2,sg3,sg4,sg5,sg6,t1,t2]
#corpus


In [9]:
result=tfidf.fit_transform(corpus)
#print(result)
print(result)


  (0, 3)	0.3038587352347985
  (0, 4)	0.7992514181378093
  (0, 0)	0.3038587352347985
  (0, 1)	0.42016295487312805
  (1, 3)	0.600978201166947
  (1, 4)	0.5269254249362817
  (1, 5)	0.600978201166947
  (2, 4)	0.6592621372425836
  (2, 0)	0.7519131827533952
  (3, 5)	0.5668934226488676
  (3, 2)	0.49704058656839417
  (3, 6)	0.6569493912480618
  (4, 4)	0.5504130215007531
  (4, 5)	0.627766685580577
  (4, 2)	0.5504130215007531
  (5, 2)	0.603357526680706
  (5, 6)	0.7974708113766555
  (6, 2)	1.0
  (7, 3)	0.7071067811865475
  (7, 0)	0.7071067811865475


In [10]:
from sklearn.decomposition import PCA
import numpy as np
X=result.toarray()
pca=PCA()
pca.fit(X)
np.cumsum(pca.explained_variance_ratio_)


array([0.47945672, 0.6845736 , 0.82897142, 0.93613894, 0.99091653,
       0.9994026 , 1.        ])

In [11]:
pca=PCA(n_components=2)
pca.fit(X)
X_red=pca.transform(X)
X_red


array([[-0.60080536,  0.10209487],
       [-0.3485942 ,  0.48596018],
       [-0.59051593, -0.18790507],
       [ 0.62234168,  0.09687855],
       [ 0.1710981 ,  0.5393439 ],
       [ 0.68263317, -0.31748088],
       [ 0.59793697, -0.17669488],
       [-0.53409443, -0.54219665]])

#### Creation of training data and target values

In [12]:
import numpy as np
X=result.toarray()
X_train=X_red[0:6,:]
y=np.array([0,1,0,1,0,0])
y=np.transpose(y)
print(X_train.shape)
print(y.shape)


(6, 2)
(6,)


### Building Logistic Regression model

In [13]:
import statsmodels.api as sm
X_tr=sm.add_constant(X_train)
#print(X_tr.shape)
model=sm.Logit(y,X_tr).fit()
model.summary()


Optimization terminated successfully.
         Current function value: 0.514977
         Iterations 6


0,1,2,3
Dep. Variable:,y,No. Observations:,6.0
Model:,Logit,Df Residuals:,3.0
Method:,MLE,Df Model:,2.0
Date:,"Wed, 11 Sep 2024",Pseudo R-squ.:,0.1909
Time:,13:16:29,Log-Likelihood:,-3.0899
converged:,True,LL-Null:,-3.8191
Covariance Type:,nonrobust,LLR p-value:,0.4823

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-1.3470,1.304,-1.033,0.302,-3.903,1.209
x1,1.4767,2.190,0.674,0.500,-2.816,5.770
x2,3.4893,3.518,0.992,0.321,-3.407,10.385


#### Prediction for training data

In [14]:
X_tr


array([[ 1.        , -0.60080536,  0.10209487],
       [ 1.        , -0.3485942 ,  0.48596018],
       [ 1.        , -0.59051593, -0.18790507],
       [ 1.        ,  0.62234168,  0.09687855],
       [ 1.        ,  0.1710981 ,  0.5393439 ],
       [ 1.        ,  0.68263317, -0.31748088]])

In [15]:
print('Inline Predicted values are',np.round(model.predict(X_tr)))


Inline Predicted values are [0. 0. 0. 0. 1. 0.]


### Conclusion : Outcomes will be Dislike,Dislike,Dislike,Dislike,Like and Dislike