<a href="https://colab.research.google.com/github/JackMAlucard/Data-Scientist-Technical-Assessment/blob/main/npl-first-approach.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#First Approach to NLP by reproducing simple projects

#Installing and using the Kaggle API to  import datasets from Kaggle directly

In [1]:
# Install the Kaggle API
!pip install kaggle



In [2]:
# Upload Kaggle API Credentials, downloaded from https://www.kaggle.com/settings
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"jackmalucard","key":"2aa4618ce05c77e80b6f52fc895aa216"}'}

In [3]:
# Move Kaggle API Credentials
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
# Set Permissions
!chmod 600 ~/.kaggle/kaggle.json

The Kaggle API can now be used to download datasets directly by using the following command:


```
!kaggle datasets download -d dataset_name
```

Where ```dataset_name``` is the actual name of the dataset to be download from Kaggle, such as ```zynicide/wine-reviews``` for example.

#First project: [Wine Reviews](https://www.kaggle.com/datasets/zynicide/wine-reviews)

###Downloading dataset from Kaggle

In [4]:
# Download dataset directly using the Kaggle API
!kaggle datasets download -d zynicide/wine-reviews

Downloading wine-reviews.zip to /content
 94% 48.0M/50.9M [00:02<00:00, 28.4MB/s]
100% 50.9M/50.9M [00:02<00:00, 20.7MB/s]


###Extracting dataset from zip file into the ```/content``` directory.

In [5]:
# Import 'zipfile' module
import zipfile

# Specify the path to the zip file with the dataset
zip_file_path = 'wine-reviews.zip'

# Specify the directory where the files are to be extracted
extract_dir = '/content'

# Extract the files
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(extract_dir)

Getting the names of the files inside the dataset to be used later

In [6]:
import os

# Specify the directory path
directory_path = '/content'

# Specify the desired file extension
extension = '.csv'

# Get a list of file names in the directory
file_names = os.listdir(directory_path)

# Filter and save filenames with the desired extension
filtered_files = [file_name for file_name in file_names if file_name.endswith(extension)]

# Print the list of file names
print("List of file names in the directory:")
for file_name in filtered_files:
    print(file_name)

List of file names in the directory:
winemag-data-130k-v2.csv
winemag-data_first150k.csv


###Loading and inspecting data

In [7]:
# Import pandas library
import pandas as pd

In [8]:
# Load the dataset files into Pandas DataFrames
df1 = pd.read_csv(filtered_files[0], index_col = 0)
df2 = pd.read_csv(filtered_files[1], index_col = 0)

In [9]:
# Dataset #1
# Display the first few rows of the DataFrames
print("First few rows of DataFrame 1:")
print(df1.head())
print("\n")

# Get basic information about the DataFrame
print("DataFrame 1 info:")
print(df1.info())

First few rows of DataFrame 1:
    country                                        description  \
0     Italy  Aromas include tropical fruit, broom, brimston...   
1  Portugal  This is ripe and fruity, a wine that is smooth...   
2        US  Tart and snappy, the flavors of lime flesh and...   
3        US  Pineapple rind, lemon pith and orange blossom ...   
4        US  Much like the regular bottling from 2012, this...   

                          designation  points  price           province  \
0                        Vulkà Bianco      87    NaN  Sicily & Sardinia   
1                            Avidagos      87   15.0              Douro   
2                                 NaN      87   14.0             Oregon   
3                Reserve Late Harvest      87   13.0           Michigan   
4  Vintner's Reserve Wild Child Block      87   65.0             Oregon   

              region_1           region_2         taster_name  \
0                 Etna                NaN       Kerin O’

In [10]:
# Dataset #2
# Display the first few rows of the DataFrames
print("First few rows of DataFrame 2:")
print(df2.head())
print("\n")

# Get basic information about the DataFrame
print("DataFrame 2 info:")
print(df2.info())

First few rows of DataFrame 2:
  country                                        description  \
0      US  This tremendous 100% varietal wine hails from ...   
1   Spain  Ripe aromas of fig, blackberry and cassis are ...   
2      US  Mac Watson honors the memory of a wine once ma...   
3      US  This spent 20 months in 30% new French oak, an...   
4  France  This is the top wine from La Bégude, named aft...   

                            designation  points  price        province  \
0                     Martha's Vineyard      96  235.0      California   
1  Carodorum Selección Especial Reserva      96  110.0  Northern Spain   
2         Special Selected Late Harvest      96   90.0      California   
3                               Reserve      96   65.0          Oregon   
4                            La Brûlade      95   66.0        Provence   

            region_1           region_2             variety  \
0        Napa Valley               Napa  Cabernet Sauvignon   
1            

##Loading libraries to be used for analysis

In [11]:
from collections import Counter
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn import svm
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split

##Dataset #1
From article analysis

In [12]:
# Identifying and filtering the rows in df1 that correspond to the top 10
# most common 'variety' values, creating a new DataFrame df with only these rows
counter = Counter(df1['variety'].tolist())
top_10_varieties = {i[0]: idx for idx, i in enumerate(counter.most_common(10))}
df = df1[df1['variety'].map(lambda x: x in top_10_varieties)]

# Convert the 'descriptions' in df into a list
description_list = df['description'].tolist()
# Transform the 'variety' values in df into numerical indices
varietal_list = [top_10_varieties[i] for i in df['variety'].tolist()]
varietal_list = np.array(varietal_list)

# Preprocess the descriptions in description_list and
# convert them into a numerical representation: a document-term matrix (DTM)
count_vect = CountVectorizer()
x_train_counts = count_vect.fit_transform(description_list)

# Convert the x_train_counts DTM into a TF-IDF representation, a matrix
tfidf_transformer = TfidfTransformer()
x_train_tfidf = tfidf_transformer.fit_transform(x_train_counts)

train_x, test_x, train_y, test_y = train_test_split(x_train_tfidf, varietal_list, test_size=0.3)

#Using SVC
clf = svm.SVC(kernel='linear').fit(train_x, train_y)
y_score = clf.predict(test_x)

n_right = 0
for i in range(len(y_score)):
    if y_score[i] == test_y[i]:
        n_right += 1

print("Accuracy: %.2f%%" % ((n_right/float(len(test_y)) * 100)))

Accuracy: 80.24%


In [15]:
counter = Counter(df1['variety'].tolist())
top_10_varieties = {i[0]: idx for idx, i in enumerate(counter.most_common(10))}
df = df1[df1['variety'].map(lambda x: x in top_10_varieties)]

description_list = df['description'].tolist()
varietal_list = [top_10_varieties[i] for i in df['variety'].tolist()]
varietal_list = np.array(varietal_list)

count_vect = CountVectorizer()
x_train_counts = count_vect.fit_transform(description_list)


tfidf_transformer = TfidfTransformer()
x_train_tfidf = tfidf_transformer.fit_transform(x_train_counts)

train_x, test_x, train_y, test_y = train_test_split(x_train_tfidf, varietal_list, test_size=0.3)

#Using Multinomial Naive Bayes
clf = MultinomialNB().fit(train_x, train_y)
y_score = clf.predict(test_x)

n_right = 0
for i in range(len(y_score)):
    if y_score[i] == test_y[i]:
        n_right += 1

print("Accuracy: %.2f%%" % ((n_right/float(len(test_y)) * 100)))

Accuracy: 63.42%


My own analysis, based on a mixture from  