<a href="https://colab.research.google.com/github/Marcin19721205/MachineLearningBootCamp/blob/main/28_movie_reviews01MJ.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### scikit-learn
Strona biblioteki: [https://scikit-learn.org](https://scikit-learn.org)  

Dokumentacja/User Guide: [https://scikit-learn.org/stable/user_guide.html](https://scikit-learn.org/stable/user_guide.html)

Podstawowa biblioteka do uczenia maszynowego w języku Python.

Aby zainstalować bibliotekę scikit-learn, użyj polecenia poniżej:
```
!pip install scikit-learn
```
Aby zaktualizować do najnowszej wersji bibliotekę scikit-learn, użyj polecenia poniżej:
```
!pip install --upgrade scikit-learn
```
Kurs stworzony w oparciu o wersję `0.22.1`

### Spis treści:
1. [Import bibliotek](#0)
2. [Pobranie danych](#1)
3. [Eksploracja i przygotowanie danych](#2)
4. [Trenowanie modelu](#3)
5. [Ocena modelu](#4)
6. [Predykcja na podstawie modelu](#5)



### <a name='0'></a> Import bibliotek

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px
import sklearn

np.random.seed(42)
np.set_printoptions(precision=6, suppress=True, edgeitems=10, linewidth=1000, formatter=dict(float=lambda x: f'{x:.2f}'))
sklearn.__version__

'1.6.1'

### <a name='1'></a> Pobranie danych

In [2]:
!wget https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/movie_reviews.zip

--2025-10-29 19:09:04--  https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/movie_reviews.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4004848 (3.8M) [application/zip]
Saving to: ‘movie_reviews.zip’


2025-10-29 19:09:05 (38.5 MB/s) - ‘movie_reviews.zip’ saved [4004848/4004848]



In [3]:
!unzip -q movie_reviews.zip

In [4]:
!pwd
!ls

/content
movie_reviews  movie_reviews.zip  sample_data


In [5]:
from sklearn.datasets import load_files

raw_movie = load_files('movie_reviews')
movie = raw_movie.copy()
movie.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

### <a name='2'></a> Eksploracja i przygotowanie danych

In [8]:
movie['data'][:2]

[b"arnold schwarzenegger has been an icon for action enthusiasts , since the late 80's , but lately his films have been very sloppy and the one-liners are getting worse . \nit's hard seeing arnold as mr . freeze in batman and robin , especially when he says tons of ice jokes , but hey he got 15 million , what's it matter to him ? \nonce again arnold has signed to do another expensive blockbuster , that can't compare with the likes of the terminator series , true lies and even eraser . \nin this so called dark thriller , the devil ( gabriel byrne ) has come upon earth , to impregnate a woman ( robin tunney ) which happens every 1000 years , and basically destroy the world , but apparently god has chosen one man , and that one man is jericho cane ( arnold himself ) . \nwith the help of a trusty sidekick ( kevin pollack ) , they will stop at nothing to let the devil take over the world ! \nparts of this are actually so absurd , that they would fit right in with dogma . \nyes , the film is

In [9]:
movie['target'][:10]

array([0, 1, 1, 0, 1, 1, 1, 1, 1, 0])

In [10]:
movie['target_names']

['neg', 'pos']

In [12]:
movie['filenames'][:3]

array(['movie_reviews/neg/cv405_21868.txt', 'movie_reviews/pos/cv190_27052.txt', 'movie_reviews/pos/cv132_5618.txt'], dtype='<U33')

In [13]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(movie['data'], movie['target'], random_state=42)

print(f'X_train: {len(X_train)}')
print(f'X_test: {len(X_test)}')

X_train: 1500
X_test: 500


In [14]:
X_train[0]

b'unzipped is a cinematic portrait of isaac mizrahi , an artist whose palette is fabric . \nostensibly , the film is a documentary , but use of that term requires stretching its meaning . \nmany scenes appear staged , and a great deal of cutting-and-pasting has been done in the editing room . \nthe cinema verite effect is a conceit -- genuine spontaneity is at a premium , and everyone is aware of and playing to the camera ( especially would-be actresses like cindy crawford ) . \ndirector douglas keeve ( who was mizrahi\'s lover at the time ) freely admits that he " couldn\'t care less about the truth " but was more interested in capturing " the spirit and love in isaac and in fashion . " \ndespite violating nearly every rule of " legitimate " documentary film making , however , unzipped is a remarkably enjoyable piece of entertainment . \nwhile it sheds only a little light on the behind-the-scenes world of the fashion industry , it presents a fascinating , if incomplete , picture of de

#Macierz TFIDF

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

# Revert X_train and X_test to original text data
X_train, X_test, y_train, y_test = train_test_split(movie['data'], movie['target'], random_state=42)

tfidf = TfidfVectorizer(max_features=3000) #3000 używanych słów w macierzy TFIDF
X_train = tfidf.fit_transform(X_train)
X_test = tfidf.transform(X_test)

print(f'X_train shape: {X_train.shape}')
print(f'X_test shape: {X_test.shape}')

X_train shape: (1500, 3000)
X_test shape: (500, 3000)


In [18]:
X_train[0]

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 206 stored elements and shape (1, 3000)>

In [19]:
X_train[0].toarray()

array([[0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.06, 0.00, ..., 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00]])

### <a name='3'></a> Trenowanie modelu

In [22]:
from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()
classifier.fit(X_train, y_train)
classifier.score(X_test, y_test)

0.808

### <a name='4'></a> Ocena modelu

In [23]:
from sklearn.metrics import confusion_matrix

y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
cm

array([[205,  35],
       [ 61, 199]])

In [24]:
import plotly.figure_factory as ff

def plot_confusion_matrix(cm):
    cm = cm[::-1]
    cm = pd.DataFrame(cm, columns=['negative', 'positive'], index=['positive', 'negative'])

    fig = ff.create_annotated_heatmap(z=cm.values, x=list(cm.columns), y=list(cm.index),
                                      colorscale='ice', showscale=True, reversescale=True)
    fig.update_layout(width=400, height=400, title='Confusion Matrix', font_size=16)
    fig.show()

plot_confusion_matrix(cm)

In [25]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred, target_names=['negative', 'positive']))

              precision    recall  f1-score   support

    negative       0.77      0.85      0.81       240
    positive       0.85      0.77      0.81       260

    accuracy                           0.81       500
   macro avg       0.81      0.81      0.81       500
weighted avg       0.81      0.81      0.81       500



### <a name='5'></a> Predykcja na podstawie modelu

In [30]:
new_reviews = ['It was awesome! Very interesting story.',
               'I cannot recommend this film. Short and awful.',
               'Very long and boring. Don\'t waste your time.',
               'Well-organized and quite interesting.']

new_reviews_tfidf = tfidf.transform(new_reviews)
new_reviews_tfidf

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 24 stored elements and shape (4, 3000)>

In [31]:
new_reviews_tfidf.toarray()

array([[0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, ..., 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
       [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, ..., 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
       [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, ..., 0.00, 0.00, 0.00, 0.00, 0.00, 0.34, 0.00, 0.00, 0.00, 0.00],
       [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, ..., 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00]])

In [32]:
new_reviews_pred = classifier.predict(new_reviews_tfidf)
new_reviews_pred

array([1, 0, 0, 1])

In [33]:
new_reviews_prob = classifier.predict_proba(new_reviews_tfidf)
new_reviews_prob

array([[0.48, 0.52],
       [0.63, 0.37],
       [0.77, 0.23],
       [0.44, 0.56]])

In [34]:
np.argmax(new_reviews_prob, axis=1)

array([1, 0, 0, 1])

In [35]:
movie['target_names']

['neg', 'pos']

In [36]:
for review, target, prob in zip(new_reviews, new_reviews_pred, new_reviews_prob):
    print(f"{review} -> {movie['target_names'][target]} -> {prob[target]:.4f}")

It was awesome! Very interesting story. -> pos -> 0.5234
I cannot recommend this film. Short and awful. -> neg -> 0.6344
Very long and boring. Don't waste your time. -> neg -> 0.7726
Well-organized and quite interesting. -> pos -> 0.5635
