
### Neural Machine Translation English Sentences
In this notebook we are going to create a new `.csv` of the english sentences that we have gathered using the `tweetpy` api.
___

Topic: `NMT`

Date: `2022/07/30`

Programming Language: `python`

Main: `Natural Language Processing (NLP)`

___

In the following code cell we are going to import all the packages that we are going to use in this notebook.

In [1]:
from google.colab import drive, files
import os
import pandas as pd

### Mounting the Drive

In the following code cell we are going to mount the google drive.

In [2]:
drive.mount('/content/drive')

Mounted at /content/drive


### Defining the Paths
In the following code cell we are going to define the paths to our files.

In [3]:
base_dir = '/content/drive/My Drive/NLP Data/nmt'

assert os.path.exists(base_dir), f"The path '{base_dir}' does not exists, check if you have mounted the google drive."

scraps_path = os.path.join(base_dir, "datasets")

assert os.path.exists(scraps_path), f"The path '{scraps_path}' does not exists, check if you have mounted the google drive."

english_path = os.path.join(base_dir, "english")

assert os.path.exists(english_path), f"The path '{english_path}' does not exists, check if you have mounted the google drive."

save_path = os.path.join(english_path, "english.csv")

### Getting all the Text.

So we are going to get all the text data and store them in a list. After doing that we are going to create a `english.csv` file based on the unique english sentences that we got by scrapping the data from twitter.

In [5]:
all_texts = list()
for i, topic in enumerate(os.listdir(scraps_path)):
  df = pd.read_csv(os.path.join(scraps_path, topic, f"{topic}.csv"))
  df.drop_duplicates(subset=["text"], inplace=True)
  texts = list(df.text.values)
  all_texts.extend(texts)

In [6]:
dataframe = pd.DataFrame(all_texts, columns=["sentence"], index=None)
dataframe.head()

Unnamed: 0,sentence
0,so where are we with this eisenhower is farewe...
1,to you still beautiful even after death i say ...
2,bid farewell to the final semester of the depa...
3,a great sadness farewell
4,to all my family out there that are over life ...


### Checking how many English sentences do we have

In the following code cell we are going to check how many unique english sentences do we have in our dataset.

In [9]:
len(dataframe)

108631

> We have got `~109k` unique english sentences in our dataset.

### Saving our dataframe.

In the following code cell we are going to save our file as a `.csv` file as follows:

In [7]:
dataframe.to_csv(save_path)
print("Done!")

Done!


### Downloading our File

In the following code cell we are going to download our saved file as a `.csv` file to our local computer as follows

In [8]:
files.download(save_path)
print("Done!")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Done!
