##  England NHS GP Reviews Sentiment Analysis
source: https://huggingface.co/datasets/janduplessis886/england-nhs-gp-reviews/viewer/default/train

In [1]:
import pandas as pd

df = pd.read_csv('england-nhs-gp-reviews.csv')
df.head()

Unnamed: 0,ode,surgeryname,url,title,star_rating,comment,visited_date
0,E81050,asplands-medical-centre,https://www.nhs.uk/services/gp-surgery/aspland...,What's changed?,3,Have been with this practice for a number of y...,August 2022
1,E81050,asplands-medical-centre,https://www.nhs.uk/services/gp-surgery/aspland...,Woburn surgery,5,I have been a patient at this practice for man...,July 2022
2,E81050,asplands-medical-centre,https://www.nhs.uk/services/gp-surgery/aspland...,Don't waste your time GPs never available,1,"Visited my gp, over resistant hypertension. Gr...",June 2022
3,E81050,asplands-medical-centre,https://www.nhs.uk/services/gp-surgery/aspland...,Great practice,5,I contacted the surgery by telephone for a non...,June 2022
4,K82064,fishermead-medical-centre,https://www.nhs.uk/services/gp-surgery/fisherm...,Welcoming and supportive,5,I have great respect for the staff at Fisherme...,July 2023


In [2]:
print("Number of rows: ", df.shape[0],"\nNumber of Columns: ", df.shape[1])

Number of rows:  61955 
Number of Columns:  7


#### Check for missing data

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61955 entries, 0 to 61954
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   ode           61955 non-null  object
 1   surgeryname   61955 non-null  object
 2   url           61955 non-null  object
 3   title         61955 non-null  object
 4   star_rating   61955 non-null  int64 
 5   comment       61955 non-null  object
 6   visited_date  61955 non-null  object
dtypes: int64(1), object(6)
memory usage: 3.3+ MB


No missing data as all 61955 rows are non-null for all columns

### Map Ratings to binary sentiment labels (Positive: 1, Neutral: 0, Negative: -1)

In [4]:
df['sentiment'] = df['star_rating'].apply(lambda x: 1 if x > 3 else 0 if x == 3 else -1)
df.head(10)

Unnamed: 0,ode,surgeryname,url,title,star_rating,comment,visited_date,sentiment
0,E81050,asplands-medical-centre,https://www.nhs.uk/services/gp-surgery/aspland...,What's changed?,3,Have been with this practice for a number of y...,August 2022,0
1,E81050,asplands-medical-centre,https://www.nhs.uk/services/gp-surgery/aspland...,Woburn surgery,5,I have been a patient at this practice for man...,July 2022,1
2,E81050,asplands-medical-centre,https://www.nhs.uk/services/gp-surgery/aspland...,Don't waste your time GPs never available,1,"Visited my gp, over resistant hypertension. Gr...",June 2022,-1
3,E81050,asplands-medical-centre,https://www.nhs.uk/services/gp-surgery/aspland...,Great practice,5,I contacted the surgery by telephone for a non...,June 2022,1
4,K82064,fishermead-medical-centre,https://www.nhs.uk/services/gp-surgery/fisherm...,Welcoming and supportive,5,I have great respect for the staff at Fisherme...,July 2023,1
5,K82064,fishermead-medical-centre,https://www.nhs.uk/services/gp-surgery/fisherm...,Selective service,1,when I tried to register to go on doctors list...,October 2022,-1
6,K82615,walnut-tree-health-centre,https://www.nhs.uk/services/gp-surgery/walnut-...,Rude staff and no GP appointments ever available,1,Staff are rude and talk over you. Cant get an ...,July 2023,-1
7,K82615,walnut-tree-health-centre,https://www.nhs.uk/services/gp-surgery/walnut-...,Thank you,3,I have been in contact with two members of sta...,July 2023,0
8,K82615,walnut-tree-health-centre,https://www.nhs.uk/services/gp-surgery/walnut-...,unprofessional staff,1,I have been ringing for 2 days trying to sort ...,June 2023,-1
9,K82615,walnut-tree-health-centre,https://www.nhs.uk/services/gp-surgery/walnut-...,This practice is failing,1,Queued outside practice from 7:30 to be told a...,April 2023,-1


### Features and Labels

In [5]:
features = df[['title', 'comment']]
labels = df['sentiment']

In [6]:
features.head()

Unnamed: 0,title,comment
0,What's changed?,Have been with this practice for a number of y...
1,Woburn surgery,I have been a patient at this practice for man...
2,Don't waste your time GPs never available,"Visited my gp, over resistant hypertension. Gr..."
3,Great practice,I contacted the surgery by telephone for a non...
4,Welcoming and supportive,I have great respect for the staff at Fisherme...


In [7]:
labels.head()

0    0
1    1
2   -1
3    1
4    1
Name: sentiment, dtype: int64

### WordNetLemmatizer function

In [8]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

import string

from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk import word_tokenize

def lemmatizer(text):
    lemmatizer = WordNetLemmatizer()
    stopWords = stopwords.words('english')
    #Tokenize
    tokens = word_tokenize(text)
    #Lemmatize
    lemmaToken = [lemmatizer.lemmatize(token.lower()) for token in tokens if token.lower() not in stopWords]
    #Remove punctuations
    lemmaToken = [token for token in lemmaToken if token not in string.punctuation]

    lemmaText = ' '.join(lemmaToken)
    return lemmaText

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kianm\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\kianm\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


#### Lemmatize features

In [9]:
features['title'] = features['title'].apply(lemmatizer)
features['comment'] = features['comment'].apply(lemmatizer)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  features['title'] = features['title'].apply(lemmatizer)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  features['comment'] = features['comment'].apply(lemmatizer)


In [17]:
features.head()

Unnamed: 0,title,comment
0,'s changed,practice number year always found excellent ho...
1,woburn surgery,patient practice many year always found helpfu...
2,n't waste time gps never available,visited gp resistant hypertension great appoin...
3,great practice,contacted surgery telephone non urgent appoint...
4,welcoming supportive,great respect staff fishermead medical centre ...


### Train, Test, Split ('comment' column only)

In [19]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(features['comment'], labels, test_size=0.2, random_state=42)

### Training the Model using TF-IDF and logistic regression

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.pipeline import make_pipeline

#TfidfVectorizer and Logistic Regression
model = make_pipeline(TfidfVectorizer(),LogisticRegression(max_iter=1000))

#Train the model
model.fit(x_train, y_train)

#Predict with model on test data
y_pred = model.predict(x_test)

#Evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: ", accuracy)

#Confusion Matrix
confMatrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", confMatrix)


Accuracy:  0.9189734484706642
Confusion Matrix:
 [[4526   17  213]
 [ 399   94  163]
 [ 204    8 6767]]


Low score for 0 class (neutral, rating 3). Data for 0 class may be too little result in such a low score. i.e. not enough data to train.

In [23]:
df['star_rating'].value_counts()

star_rating
5    29495
1    18054
4     5253
2     5192
3     3286
0      675
Name: count, dtype: int64

Positive value counts (4, 5): 34748 (56.1% of data)
Negative value counts (0, 1, 2): 23921 (38.6% of data)
Neutral value counts (3): 3286 (5.3% of data)

Rating 3 has only 5.3% of data.
Drop ratings that are 3 as it is too little data to be meaningful.

### Drop rows that have ratings that are 3

In [27]:
df_drop0 = df[df['sentiment']!=0]
df_drop0['sentiment'].value_counts()

sentiment
 1    34748
-1    23921
Name: count, dtype: int64

### Features and labels without sentiments 0

In [29]:
featuresNoZero = df_drop0['comment']
labelsNoZero = df_drop0['sentiment']

In [30]:
featuresNoZero.head()

1    I have been a patient at this practice for man...
2    Visited my gp, over resistant hypertension. Gr...
3    I contacted the surgery by telephone for a non...
4    I have great respect for the staff at Fisherme...
5    when I tried to register to go on doctors list...
Name: comment, dtype: object

In [31]:
labelsNoZero.head()

1    1
2   -1
3    1
4    1
5   -1
Name: sentiment, dtype: int64

#### Lemmatize featureNoZero

In [32]:
featuresNoZero= featuresNoZero.apply(lemmatizer)

In [33]:
featuresNoZero.head()

1    patient practice many year always found helpfu...
2    visited gp resistant hypertension great appoin...
3    contacted surgery telephone non urgent appoint...
4    great respect staff fishermead medical centre ...
5    tried register go doctor list told catchment a...
Name: comment, dtype: object

### Train, Test, Split features and labels(no zero)

In [34]:
x_trainN, x_testN, y_trainN, y_testN = train_test_split(featuresNoZero, labelsNoZero, test_size=0.2, random_state=42)

### Train the model

In [35]:
#TfidfVectorizer and Logistic Regression
modelN = make_pipeline(TfidfVectorizer(),LogisticRegression(max_iter=1000))

#Train the model
modelN.fit(x_trainN, y_trainN)

#Predict with model on test data
y_predN = modelN.predict(x_testN)

#Evaluate model performance
accuracyN = accuracy_score(y_testN, y_predN)
print("Accuracy: ", accuracyN)

#Confusion Matrix
confMatrixN = confusion_matrix(y_testN, y_predN)
print("Confusion Matrix:\n", confMatrixN)


Accuracy:  0.9590080109084711
Confusion Matrix:
 [[4473  237]
 [ 244 6780]]


### Export the model

In [36]:
import pickle
'''
with open('commentSentiment.pkl', 'wb') as file:
    pickle.dump(modelN, file)
'''

### Load the model

In [None]:
#loaded_model = pickle.load(open('commentSentiment.pkl', 'rb'))