<a href="https://colab.research.google.com/github/Pratibhahappy/Project---Wikipedia-Article-Classification/blob/main/Project_Wikipedia_Article_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project - Wikipedia Article Classification

Step 1 - Importing Dataset and necessary libraries

In [1]:
import warnings
warnings.filterwarnings("ignore")

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pip3 install wikipedia-api

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wikipedia-api
  Downloading Wikipedia_API-0.5.8-py3-none-any.whl (13 kB)
Installing collected packages: wikipedia-api
Successfully installed wikipedia-api-0.5.8


In [11]:
import wikipediaapi
import pandas as pd
import numpy as np

In [4]:
wiki = pd.read_csv('/content/drive/MyDrive/articleDesc.csv', sep='@$@', engine ='python')

Step 2 - Creating a Dataframe for the unstructured data

In [5]:
wikifinal = []
for index, row in wiki.iterrows():
    a = row['Article Name@$@Vital Article@$@Level@$@Class@$@Importance@$@Topic@$@Wikiproject'].split("@$@")
    wikifinal.append(a)

In [6]:
wiki_df = pd.DataFrame(wikifinal)
wiki_df

Unnamed: 0,0,1,2,3,4,5,6,7
0,Population_history_of_ancient_Egypt_Archive_8,,,,,,,
1,Sin_Mirar_Atrás,,,,,,,
2,RMWFC,,,,,,,
3,Wilbur_E._Colyer,,,Stub,,"['biography', 'military history', 'united stat...",,
4,Salmson_B.9,,,Start,,,,
...,...,...,...,...,...,...,...,...
6153246,I_Wanna_Hold_You,,,Stub,,[],,
6153247,Setback_(architecture),,,,,"['architecture', 'urban studies and planning']",,
6153248,China_and_the_World_Trade_Organization,,,Start,,['china'],,
6153249,Jan_de_Vries_(athlete),,,Stub,,"['running', 'olympics', 'biography', 'netherla...",,


Exploring Wikipedia API

In [16]:
# Getting a Single Page

wiki_wiki = wikipediaapi.Wikipedia('en')
page_py = wiki_wiki.page('Population_history_of_ancient_Egypt_Archive_8')

page_py

Population_history_of_ancient_Egypt_Archive_8 (id: ??, ns: 0)

In [21]:
#Checking if Wiki Page existes or not

page_py = wiki_wiki.page('Sin_Mirar_Atrás')
print("Page - Exists: %s" % page_py.exists())
# Page - Exists: True

page_missing = wiki_wiki.page('Non_Existing_Page')
print("Page - Exists: %s" %     page_missing.exists())
# Page - Exists: False

Page - Exists: True
Page - Exists: False


In [26]:
# Print Page Title and Page Summary

print("Page - Title: %s" % page_py.title)
print("Page - Summary: %s" % page_py.summary[0:50000])

Page - Title: Sin Mirar Atrás (David Bisbal album)
Page - Summary: Sin Mirar Atrás (Without Looking Back) is the fourth studio album recorded by Spanish singer David Bisbal. It was released on October 20, 2009, by Universal Music Spain. It was re-released on July 27, 2010, as Sin Mirar Atrás (24 Horas + Edition).


In [27]:
# Print Page URL

print(page_py.fullurl)
print(page_py.canonicalurl)

https://en.wikipedia.org/wiki/Sin_Mirar_Atr%C3%A1s_(David_Bisbal_album)
https://en.wikipedia.org/wiki/Sin_Mirar_Atr%C3%A1s_(David_Bisbal_album)


In [32]:
# Getting Page Categories

def print_categories(page_py):
        categories = page_py.categories
        for title in sorted(categories.keys()):

            print("%s: %s" % (title, categories[title]))

Step 3 - Data Cleansing and Preprocessing

In [7]:
# An extra Column is created while splitting & names it as 'BlankCol'
wiki_df = pd.DataFrame(wikifinal, columns = ['Article Name', 'Vital Article', 'Level', 'Class', 'Importance', 'Topic', 'BlankCol', 'Wikiproject'])
wiki_df.head()

Unnamed: 0,Article Name,Vital Article,Level,Class,Importance,Topic,BlankCol,Wikiproject
0,Population_history_of_ancient_Egypt_Archive_8,,,,,,,
1,Sin_Mirar_Atrás,,,,,,,
2,RMWFC,,,,,,,
3,Wilbur_E._Colyer,,,Stub,,"['biography', 'military history', 'united stat...",,
4,Salmson_B.9,,,Start,,,,


In [8]:
wiki_df = wiki_df.fillna('NA')

# Remove special characters from the columns
wiki_df['Article Name'] = wiki_df['Article Name'].str.replace('\W', ' ', regex=True)
wiki_df['Vital Article'] = wiki_df['Vital Article'].str.replace('\W', ' ', regex=True)
wiki_df['Level'] = wiki_df['Level'].str.replace('\W', ' ', regex=True)
wiki_df['Class'] = wiki_df['Class'].str.replace('\W', ' ', regex=True)
wiki_df['Importance'] = wiki_df['Importance'].str.replace('\W', ' ', regex=True)
wiki_df['Topic'] = wiki_df['Topic'].str.replace('\W', ' ', regex=True)
wiki_df['Wikiproject'] = wiki_df['Wikiproject'].str.replace('\W', ' ', regex=True)

# Removing extra spaces
wiki_df['Article Name'] = wiki_df['Article Name'].str.strip()
wiki_df['Vital Article'] = wiki_df['Vital Article'].str.strip()
wiki_df['Level'] = wiki_df['Level'].str.strip()
wiki_df['Class'] = wiki_df['Class'].str.strip()
wiki_df['Importance'] = wiki_df['Importance'].str.strip()
wiki_df['Topic'] = wiki_df['Topic'].str.strip()
wiki_df['Wikiproject'] = wiki_df['Wikiproject'].str.strip()

In [9]:
wiki_df

Unnamed: 0,Article Name,Vital Article,Level,Class,Importance,Topic,BlankCol,Wikiproject
0,Population_history_of_ancient_Egypt_Archive_8,,,,,,,
1,Sin_Mirar_Atrás,,,,,,,
2,RMWFC,,,,,,,
3,Wilbur_E _Colyer,,,Stub,,biography military history united states,,
4,Salmson_B 9,,,Start,,,,
...,...,...,...,...,...,...,...,...
6153246,I_Wanna_Hold_You,,,Stub,,,,
6153247,Setback_ architecture,,,,,architecture urban studies and planning,,
6153248,China_and_the_World_Trade_Organization,,,Start,,china,,
6153249,Jan_de_Vries_ athlete,,,Stub,,running olympics biography netherland...,,


In [10]:
# Creating a Final Class Column by replacing the categories to numeric of Class Column. 

wiki_df['Final_Class'] = wiki_df['Class'].replace(['FA' , 'Good', 'B' , 'C' , 'Start' , 'Stub' , 'NA'], [1,0,0,0,0,0,0])

wiki_df.head()

Unnamed: 0,Article Name,Vital Article,Level,Class,Importance,Topic,BlankCol,Wikiproject,Final_Class
0,Population_history_of_ancient_Egypt_Archive_8,,,,,,,,0
1,Sin_Mirar_Atrás,,,,,,,,0
2,RMWFC,,,,,,,,0
3,Wilbur_E _Colyer,,,Stub,,biography military history united states,,,0
4,Salmson_B 9,,,Start,,,,,0


In [33]:
wiki_df['Final_Class'].value_counts()

0                                            5931459
List                                          187738
GA                                             27103
1                                               5413
A                                                910
                                              ...   
GA    WikiProject International relations          1
B    WikiProject Trains                            1
B    WikiProject History of Science                1
GA    WikiProject Plants                           1
C    WikiProject Technology                        1
Name: Final_Class, Length: 160, dtype: int64

In [34]:
  # Import label encoder
from sklearn import preprocessing
  
# label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()
  
# Encode labels in column 'species'.
wiki_df['Article Name']= label_encoder.fit_transform(wiki_df['Article Name'])
wiki_df['Vital Article']= label_encoder.fit_transform(wiki_df['Vital Article'])
wiki_df['Level']= label_encoder.fit_transform(wiki_df['Level'])
wiki_df['Class']= label_encoder.fit_transform(wiki_df['Class'])
wiki_df['Importance']= label_encoder.fit_transform(wiki_df['Importance'])
wiki_df['Vital Article']= label_encoder.fit_transform(wiki_df['Vital Article'])
wiki_df['Topic']= label_encoder.fit_transform(wiki_df['Topic'])
wiki_df['BlankCol']= label_encoder.fit_transform(wiki_df['BlankCol'])
wiki_df['Wikiproject']= label_encoder.fit_transform(wiki_df['Wikiproject'])

wiki_df['Article Name'].unique()
wiki_df['Vital Article'].unique()
wiki_df['Level'].unique()
wiki_df['Class'].unique()
wiki_df['Importance'].unique()
wiki_df['Vital Article'].unique()
wiki_df['Topic'].unique()
wiki_df['BlankCol'].unique()
wiki_df['Wikiproject'].unique()

wiki_df

Unnamed: 0,Article Name,Vital Article,Level,Class,Importance,Topic,BlankCol,Wikiproject,Final_Class
0,4402968,0,13,105,15,11,0,0,0
1,4999811,0,13,105,15,11,0,0,0
2,4517733,0,13,105,15,11,0,0,0
3,5928815,0,13,134,15,214708,0,0,0
4,4806348,0,13,106,15,11,0,0,0
...,...,...,...,...,...,...,...,...,...
6153246,2560970,0,13,134,15,0,0,0,0
6153247,4925138,0,13,105,15,19998,0,0,0
6153248,1218930,0,13,106,15,282440,0,0,0
6153249,2726528,0,13,134,15,475348,0,0,0


In [35]:
wiki_df['BlankCol'].value_counts()

0    6153251
Name: BlankCol, dtype: int64

In [36]:
wiki_df.drop(['BlankCol'], axis=1)

Unnamed: 0,Article Name,Vital Article,Level,Class,Importance,Topic,Wikiproject,Final_Class
0,4402968,0,13,105,15,11,0,0
1,4999811,0,13,105,15,11,0,0
2,4517733,0,13,105,15,11,0,0
3,5928815,0,13,134,15,214708,0,0
4,4806348,0,13,106,15,11,0,0
...,...,...,...,...,...,...,...,...
6153246,2560970,0,13,134,15,0,0,0
6153247,4925138,0,13,105,15,19998,0,0
6153248,1218930,0,13,106,15,282440,0,0
6153249,2726528,0,13,134,15,475348,0,0


In [37]:
wiki_df['Vital Article'].value_counts()

0    6121036
1      32215
Name: Vital Article, dtype: int64

In [38]:
wiki_df['Level'].value_counts()

13    6121036
5       22197
4        9004
3         897
2          90
1          10
0           2
15          1
14          1
7           1
11          1
10          1
18          1
12          1
8           1
17          1
6           1
20          1
16          1
19          1
9           1
21          1
Name: Level, dtype: int64

In [39]:
wiki_df['Importance'].value_counts()

15    6121051
16      11919
8        3604
24       3269
25       2838
1        2011
3        2008
10       1984
28       1796
13       1421
14        477
18        453
12        153
2         116
7          46
9          37
23         23
33          8
20          8
30          5
26          4
19          3
27          3
32          2
29          2
21          2
5           1
0           1
4           1
22          1
6           1
31          1
11          1
17          1
Name: Importance, dtype: int64

In [40]:
wiki_df['Topic'].value_counts()

11        1886347
137480     211722
512352      74133
459580      59228
14316       54500
           ...   
22443           1
338568          1
517691          1
506508          1
242912          1
Name: Topic, Length: 536804, dtype: int64

In [41]:
wiki_df['Wikiproject'].value_counts()

0     6153241
3           1
4           1
9           1
10          1
2           1
7           1
6           1
8           1
1           1
5           1
Name: Wikiproject, dtype: int64

In [42]:
# Seperating the data for analysis
featured_article = wiki_df[wiki_df.Final_Class == 1]
non_featured_article = wiki_df[wiki_df.Final_Class == 0]

print(featured_article.shape)
print(non_featured_article.shape)

(5413, 9)
(5931459, 9)


Step 4 - Creation of New Dataset using Random Sampling

Build a sample dataset containing distribution of Featured and Non-Featured Articles

No of Featured Articles - 5413

Taking a random sample of 50000 non featured articles

In [43]:
non_featured_article_sample = non_featured_article.sample(n=50000)

Concatenation of two DataFrames

In [44]:
new_dataset = pd.concat([featured_article,non_featured_article_sample],axis=0)
new_dataset

Unnamed: 0,Article Name,Vital Article,Level,Class,Importance,Topic,BlankCol,Wikiproject,Final_Class
1118,2464884,1,4,87,10,362229,0,0,1
2394,1077827,1,4,87,8,29392,0,0,1
2402,1749711,1,5,87,16,213525,0,0,1
2629,3489384,1,5,87,28,0,0,0,1
3078,1042050,0,13,87,15,231940,0,0,1
...,...,...,...,...,...,...,...,...,...
1539428,2152506,0,13,134,15,166543,0,0,0
4173207,538364,0,13,105,15,485283,0,0,0
4772765,5977550,0,13,106,15,379247,0,0,0
791839,236627,0,13,106,15,0,0,0,0


In [45]:
new_dataset['Final_Class'].value_counts()

0    50000
1     5413
Name: Final_Class, dtype: int64

Step 5 - Splitting the data into Features and Targets

In [46]:
from sklearn.model_selection import train_test_split

X = new_dataset.drop(columns='Final_Class', axis=1)
Y = new_dataset['Final_Class'].astype('int')

# Split the data into training and testing sets

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.3,  random_state=2)

In [47]:
print(X.shape,x_train.shape,x_test.shape)

(55413, 8) (38789, 8) (16624, 8)


Step 6 - Model Building and Training

Model 1 - Logistic Regression Model

In [48]:
from sklearn.linear_model import LogisticRegression

#Building Logistic Regression Model
model = LogisticRegression()

In [49]:
# Training logistic regression model with Training Data
model.fit(x_train , y_train)

In [50]:
# Evaluate the model on the testing set
lr_acc = model.score(x_test, y_test)
lr_acc

0.8846847930702598

Model 2 -  Support Vector Machine (SVM) Model

In [None]:
from sklearn import svm
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=0)

# Create an SVM classifier with a linear kernel
clf = svm.SVC(kernel='linear', C=1.0)

# Train the classifier on the training data
clf.fit(x_train, y_train)

In [None]:
# Evaluate the classifier's accuracy
accuracy = clf.score(x_test, y_test)

accuracy

# To answer the post-evaluation questions:

1. The accuracy of the model can be determined by comparing the predicted labels with the actual labels. However, the accuracy will depend on the quality of the features extracted and the choice of the classification algorithm.

2. SVM is another popular classification algorithm that can be used instead of logistic regression. SVM tries to find a hyperplane that separates the data into different classes with the maximum margin. SVM can sometimes perform better than logistic regression for complex datasets. 

3. The most important features for classification can be identified by analyzing the coefficients of the logistic regression model. You can also use feature selection techniques such as recursive feature elimination (RFE) to identify the most important features. It is unlikely to achieve the same accuracy using only a single feature, as the features are likely to be correlated with.