<a href="https://colab.research.google.com/github/ShaliniR8/R1-task/blob/master/News_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**TASK 2**


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**PATHS**

In [2]:
train_path = r'/content/drive/My Drive/news-topic-classification-master/BBC News Train.csv'
test_path = r'/content/drive/My Drive/news-topic-classification-master/BBC News Test.csv'
solution_path = r'/content/drive/My Drive/news-topic-classification-master/BBC News Sample Solution.csv'

**Import Libraries**

In [3]:
import nltk
import pandas as pd
import re
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('wordnet')
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegressionCV
import pickle

stop = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


**Import dataset**

In [4]:
train_df = pd.read_csv(train_path, sep=',')

In [5]:
train_df.iloc[0]['Text']



In [6]:
print('There is no empty row under Text column:',all(train_df['Text']))
print('There is no empty row under Category column:',all(train_df['Category']))

There is no empty row under Text column: True
There is no empty row under Category column: True


**PREPROCESS FUNCTIONS**

In [7]:
#preprocess method does text preprocessing with regex, stopword removal,stemming and lemmatization.

def preprocess(df):
  
  def preprocess_text(text):
    s = re.sub('[^\s+\w+]', '', text)
    s = re.sub('\n|\t', ' ', s)
    s = re.sub('[0-9]+', '', s)
    return s

  def remove_stopword(text):
    text = [ word.lower() for word in text.split() if word not in stop]
    return text

  port = PorterStemmer()
  def stemming(text):
    s = [port.stem(word) for word in text]
    return s

  lem = WordNetLemmatizer()
  def lemmatize(text):
      lem_text = [ lem.lemmatize(words) for words in text ]
      return ' '.join(lem_text)


  df['Text'] = df['Text'].apply(preprocess_text)
  df['Text'] = df['Text'].apply(remove_stopword)
  df['Text'] = df['Text'].apply(stemming)
  df['Text'] = df['Text'].apply(lemmatize)

  # ----example----
  # Raw: This sentence makes no sense. Hi i am writing a\n/ <project> %% on !!topic. A good fox jumps over the wall!!!!
  # After regex:  This sentence makes no sense Hi i am writing a  project  on topic A good fox jumps over the wall
  # After stopword removal:  ['this', 'sentence', 'makes', 'sense', 'hi', 'writing', 'project', 'topic', 'a', 'good', 'fox', 'jumps', 'wall']
  # After stemming:  ['thi', 'sentenc', 'make', 'sens', 'hi', 'write', 'project', 'topic', 'a', 'good', 'fox', 'jump', 'wall']
  # After lemmatizing:  thi sentenc make sen hi write project topic a good fox jump wall
  # ---------------
  return df['Text']

  
  

**VECTORIZE FUNCTION**

In [8]:
def vectorize(df):
    tfid = TfidfVectorizer( smooth_idf = True, use_idf = True)
    X = tfid.fit_transform(df)
    saved_tfidf = open('saved_tfidf.sav', 'wb')
    pickle.dump(tfid , saved_tfidf)
    saved_tfidf.close()
    return X

**APPLYING PREPROCESS AND VECTORIZATION ON TRAINING SET**

In [9]:
train_df['Text'] = preprocess(train_df)
#train_df.iloc[0]['Text']

In [10]:
X_train, y_train = vectorize(train_df['Text']), train_df['Category']

**APPLYING SVC WITH PCA DECOMPOSITION**

In [11]:
lr = LogisticRegressionCV(
    cv = 5,
    scoring = 'accuracy',
    verbose = 3,
    max_iter = 300,
    n_jobs = -1
)

lr.fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   57.9s finished


LogisticRegressionCV(Cs=10, class_weight=None, cv=5, dual=False,
                     fit_intercept=True, intercept_scaling=1.0, l1_ratios=None,
                     max_iter=300, multi_class='auto', n_jobs=-1, penalty='l2',
                     random_state=None, refit=True, scoring='accuracy',
                     solver='lbfgs', tol=0.0001, verbose=3)

In [13]:
saved_model = open( 'saved_model.sav', 'wb')
pickle.dump( lr , saved_model )
saved_model.close()

**LOADING TESTING DATASET**

In [14]:
test_df = pd.read_csv(test_path)
solution_df = pd.read_csv(solution_path)
test_df.iloc[0]['Text']

'qpr keeper day heads for preston queens park rangers keeper chris day is set to join preston on a month s loan.  day has been displaced by the arrival of simon royce  who is in his second month on loan from charlton. qpr have also signed italian generoso rossi. r s manager ian holloway said:  some might say it s a risk as he can t be recalled during that month and simon royce can now be recalled by charlton.  but i have other irons in the fire. i have had a  yes  from a couple of others should i need them.   day s rangers contract expires in the summer. meanwhile  holloway is hoping to complete the signing of middlesbrough defender andy davies - either permanently or again on loan - before saturday s match at ipswich. davies impressed during a recent loan spell at loftus road. holloway is also chasing bristol city midfielder tom doherty.'

In [15]:
print('There is no empty row under Text column:',all(test_df['Text']))

There is no empty row under Text column: True


**PREPROCESS AND VECTORIZE TEST SET**

In [16]:
model = pickle.load(open( 'saved_model.sav' , 'rb' ))
tfidf = pickle.load(open( 'saved_tfidf.sav' , 'rb' ))

In [17]:
test_df['Text'] = preprocess(test_df)
X_test, y_test = tfidf.transform(test_df['Text']), solution_df['Category']

**EVALUATE**

In [18]:
y_pred = model.predict(X_test)
y_pred

array(['sport', 'tech', 'sport', 'business', 'sport', 'sport', 'politics',
       'politics', 'entertainment', 'business', 'business', 'tech',
       'politics', 'tech', 'entertainment', 'sport', 'politics', 'tech',
       'entertainment', 'entertainment', 'business', 'politics', 'sport',
       'business', 'politics', 'sport', 'business', 'sport', 'sport',
       'business', 'politics', 'tech', 'business', 'business', 'sport',
       'sport', 'sport', 'business', 'entertainment', 'entertainment',
       'tech', 'politics', 'entertainment', 'tech', 'sport', 'tech',
       'entertainment', 'business', 'politics', 'business', 'politics',
       'business', 'business', 'business', 'tech', 'business', 'tech',
       'entertainment', 'sport', 'tech', 'sport', 'entertainment', 'tech',
       'politics', 'entertainment', 'entertainment', 'sport', 'tech',
       'sport', 'sport', 'business', 'sport', 'business', 'politics',
       'tech', 'sport', 'tech', 'tech', 'tech', 'entertainment',
     

In [19]:
model.score(X_test, y_test)

0.19183673469387755

In [20]:
from sklearn.metrics import classification_report

print(classification_report( y_test, y_pred))

               precision    recall  f1-score   support

     business       0.21      0.24      0.23       147
entertainment       0.21      0.16      0.18       147
     politics       0.17      0.16      0.17       147
        sport       0.20      0.23      0.22       147
         tech       0.17      0.16      0.16       147

     accuracy                           0.19       735
    macro avg       0.19      0.19      0.19       735
 weighted avg       0.19      0.19      0.19       735

