Реализовать базовую модель логистической регрессии для классификации текстовых сообщений (используемые данные здесь) по признаку спама.


Что сделать:
Привести весь текст к нижнему регистру;
Удалить мусорные символы;
Удалить стоп-слова;
Привести все слова к нормальной форме;
Преобразуйте все сообщения в вектора TF-IDF.
Разделите данные на тестовые и тренировочные.
Постройте модель логистической регрессии, оценить ее точность на тестовых данных;
Описать результаты при помощи confusion_matrix;
Построить датафрейм, который будет содержать все исходные тексты сообщений, классифицированные неправильно (с указанием фактического и предсказанного).

In [None]:
import glob
import pandas as pd
import re

In [None]:
df=pd.read_csv('spam.csv')
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
#Приводим текст к нижнему регистру
df.Message = df.Message.str.lower()
df.head()

Unnamed: 0,Category,Message
0,ham,"go until jurong point, crazy.. available only ..."
1,ham,ok lar... joking wif u oni...
2,spam,free entry in 2 a wkly comp to win fa cup fina...
3,ham,u dun say so early hor... u c already then say...
4,ham,"nah i don't think he goes to usf, he lives aro..."


In [None]:
#Удаляем не нужные символы
df.Message = df.Message.str.replace(r'[^\w\s]+', ' ')

  df.Message = df.Message.str.replace(r'[^\w\s]+', ' ')


In [None]:
df.head()

Unnamed: 0,Category,Message
0,ham,go until jurong point crazy available only i...
1,ham,ok lar joking wif u oni
2,spam,free entry in 2 a wkly comp to win fa cup fina...
3,ham,u dun say so early hor u c already then say
4,ham,nah i don t think he goes to usf he lives aro...


In [None]:
#Удаляем стоп слова

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
# Исключаем стоп слова
df['Message'] = df['Message'].apply(lambda x: ' '.join([item for item in x.split() if item not in stop]))

In [None]:
df.head()

Unnamed: 0,Category,Message
0,ham,go jurong point crazy available bugis n great ...
1,ham,ok lar joking wif u oni
2,spam,free entry 2 wkly comp win fa cup final tkts 2...
3,ham,u dun say early hor u c already say
4,ham,nah think goes usf lives around though


In [None]:
#Приводим все слова к нормальной форме (лемматизации)
from nltk.stem import WordNetLemmatizer
nltk.download('omw-1.4')
nltk.download('wordnet')

wordnet_lemmatizer = WordNetLemmatizer()
df['Message'] = [[wordnet_lemmatizer.lemmatize(word) for word in text] for text in df['Message']]
df['Message'] = df['Message'].apply(''.join)
df.head()

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,Category,Message
0,ham,go jurong point crazy available bugis n great ...
1,ham,ok lar joking wif u oni
2,spam,free entry 2 wkly comp win fa cup final tkts 2...
3,ham,u dun say early hor u c already say
4,ham,nah think goes usf lives around though


In [None]:
#Преобразуем слва в вектора TF-IDF
'''
Матрица TF-IDF используется для оценки важности слов в
 текстовых документах на основе их частоты в документе
  и обратной частоты их встречаемости в коллекции документов.
'''
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(df.Message)
names = tfidf.get_feature_names_out()
tfidf_matrix = pd.DataFrame(tfidf_matrix.toarray(), columns=names)

In [None]:
tfidf_matrix

Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585236,01223585334,0125698789,02,...,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada,èn,ú1,〨ud
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5567,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5568,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5569,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5570,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
from sklearn. model_selection import train_test_split
from sklearn. linear_model import LogisticRegression
from sklearn import metrics

In [None]:
X = tfidf_matrix
y = df.Category

In [None]:
#Делим данные на тестовые и тренировочные
# разделение выборки на 30/70
X_train,X_test,y_train,y_test = train_test_split (X,y,test_size=0.3,random_state=42)

In [None]:
#Строим модель логистической регрессии и  оцениваем ее точность на тестовых данных;
#выбор модели
log_regression = LogisticRegression()

#обучение модели
log_regression. fit (X_train,y_train)

#использование модели на тестовых данных
y_pred = log_regression. predict (X_test)

In [None]:
#Опишите результаты при помощи confusion_matrix
cnf_matrix = metrics.confusion_matrix (y_test, y_pred)
cnf_matrix

array([[1445,    3],
       [  69,  155]])

'''
array([[1445,    3],
       [  69,  155]])

1.1 Истинно положительных прогнозов 1445
1.2 Ложноотрицательных 3
2.1 Ложноположительные прогнозы 69
2.2 Ложноотрицательные 155
'''

In [None]:
print(" Accuracy:",metrics. accuracy_score (y_test, y_pred)) # точность метода


 Accuracy: 0.9569377990430622


строим датафрейм, который будет содержать все исходные тексты сообщений, классифицированные неправильно (с указанием фактического и предсказанного).

In [None]:
y_test.describe()

count     1672
unique       2
top        ham
freq      1448
Name: Category, dtype: object

In [None]:
y_test

3245     ham
944      ham
1044     ham
2484     ham
812      ham
        ... 
2505     ham
2525    spam
4975     ham
650     spam
4463     ham
Name: Category, Length: 1672, dtype: object

In [None]:
y_test_df = pd.DataFrame(y_test)

In [None]:
y_test_df

Unnamed: 0,Category
3245,ham
944,ham
1044,ham
2484,ham
812,ham
...,...
2505,ham
2525,spam
4975,ham
650,spam


In [None]:
y_pred_df = pd.DataFrame(y_pred)

In [None]:
y_pred_df

Unnamed: 0,0
0,ham
1,ham
2,ham
3,ham
4,ham
...,...
1667,ham
1668,spam
1669,ham
1670,spam


In [None]:
y_pred_df.rename(columns = {0:'Category'}, inplace = True)

In [None]:
y_pred_df

Unnamed: 0,Category
0,ham
1,ham
2,ham
3,ham
4,ham
...,...
1667,ham
1668,spam
1669,ham
1670,spam


In [None]:
# В y_test_df -это тестовая выборка. В ней есть индексы по которым она выбрана.
#Чтобы получить текст необходимо сджоинить по индексам начальный датафрейм и y_test_df.
# Мы получим датафрейм  с тестовой выборокой и с текстами.
out = pd.merge(y_test_df, df, left_index=True, right_index=True)


In [None]:
out

Unnamed: 0,Category_x,Category_y,Message
3245,ham,ham,squeeeeeze christmas hug u lik frndshp den hug...
944,ham,ham,also sorta blown couple times recently id rath...
1044,ham,ham,mmm thats better got roast b better drinks 2 g...
2484,ham,ham,mm kanji dont eat anything heavy ok
812,ham,ham,ring comes guys costumes gift future yowifes h...
...,...,...,...
2505,ham,ham,hello boytoy made home constant thought love h...
2525,spam,spam,free entry 250 weekly comp send word win 80086...
4975,ham,ham,aiyo u poor thing u dun wan 2 eat u bathe already
650,spam,spam,1 000 cash 2 000 prize claim call09050000327 c...


In [None]:
# y_pred_df имеют индексы по водзрастанию. Чтобы вычесть из тестотовой выборку предсказывающую, необходимо обновить индексы.
out.reset_index(inplace = True)

In [None]:
out

Unnamed: 0,index,Category_x,Category_y,Message
0,3245,ham,ham,squeeeeeze christmas hug u lik frndshp den hug...
1,944,ham,ham,also sorta blown couple times recently id rath...
2,1044,ham,ham,mmm thats better got roast b better drinks 2 g...
3,2484,ham,ham,mm kanji dont eat anything heavy ok
4,812,ham,ham,ring comes guys costumes gift future yowifes h...
...,...,...,...,...
1667,2505,ham,ham,hello boytoy made home constant thought love h...
1668,2525,spam,spam,free entry 250 weekly comp send word win 80086...
1669,4975,ham,ham,aiyo u poor thing u dun wan 2 eat u bathe already
1670,650,spam,spam,1 000 cash 2 000 prize claim call09050000327 c...


In [None]:
# Вычитаем из тестовой y_pred_df получаем датафрейм с данными.
out = out[out.Category_x != y_pred_df.Category]

In [None]:
# Чистим датафрейм
out = out.drop(['index','Category_x'], axis =1)

In [None]:
out

Unnamed: 0,Category_y,Message
17,ham,hey free call
40,spam,reminder downloaded content already paid goto ...
47,spam,guess somebody know secretly fancies wanna fin...
74,spam,oh god found number glad text back xafter msgs...
84,spam,next amazing xxx picsfree1 video sent enjoy on...
...,...,...
1525,spam,freemsg hi baby wow got new cam moby wanna c h...
1567,spam,important customer service announcement premier
1569,spam,themob check newest selection content games to...
1576,ham,free call


In [None]:
out.rename(columns = {'Category_y':'Category'}, inplace = True)


In [None]:
out

Unnamed: 0,Category,Message
17,ham,hey free call
40,spam,reminder downloaded content already paid goto ...
47,spam,guess somebody know secretly fancies wanna fin...
74,spam,oh god found number glad text back xafter msgs...
84,spam,next amazing xxx picsfree1 video sent enjoy on...
...,...,...
1525,spam,freemsg hi baby wow got new cam moby wanna c h...
1567,spam,important customer service announcement premier
1569,spam,themob check newest selection content games to...
1576,ham,free call
