# Most common words in portuguese

# 1. Aims, objectives and background

## 1.1. Introduction

Portuguese is the sixth most spoken language in the world, with over 260 million native speakers. Moreover, Portuguese-speaking countries have a rich cultural heritage, including literature, music, cinema, art, and traditions. Learning Portuguese allows to engage more deeply with these cultural expressions, appreciate their nuances, and develop a broader understanding and appreciation of Lusophone cultures.

Learning the most common words in another language is a beneficial approach because by focusing on the most common words, it is possible to acquire a foundation that allows to understand and express in various everyday situations. These words often cover essential concepts and vocabulary used in daily communication, making them highly useful for basic interactions.

In this project, the focus was on analyzing a corpus in Portuguese using the NLTK library. The NLTK (Natural Language Toolkit) library is a widely used open-source library in Python for natural language processing (NLP) tasks. It provides various tools, data sets, and algorithms to facilitate tasks such as tokenization, stemming, part-of-speech tagging, parsing, and more.


## 1.2. Aims and objectives

The goal of this projects was to identify the 100 most common words in the Portuguese language. Understanding and learning the most frequently used words is considered a fundamental step in language acquisition. To aid in this process, the identified common words were translated into English. This project aimed to provide a valuable resource for individuals looking to begin their journey in learning the Portuguese language by starting with its most commonly used words.

In [1]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')

from __future__ import print_function, unicode_literals

from nltk.corpus import machado, mac_morpho, floresta, genesis
from nltk.text import Text
from nltk.probability import FreqDist

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\casar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\casar\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
ptext1 = Text(machado.words('romance/marm05.txt'), name="Memórias Póstumas de Brás Cubas (1881)")
print("ptext1:", ptext1.name)

ptext1: Memórias Póstumas de Brás Cubas (1881)


In [3]:
portuguese_fields = genesis.fileids()


In [4]:
portuguese_fields

['english-kjv.txt',
 'english-web.txt',
 'finnish.txt',
 'french.txt',
 'german.txt',
 'lolcat.txt',
 'portuguese.txt',
 'swedish.txt']

Using different corpus such as:
* several machado books.
* mac_morpho is a collecton of texts extracted from the newspaper *Folha de Sao Paulo*.
* floresta is collection of phrases analysed (morpho)syntactically.
* genesis 

In [5]:
# Access to the words for each corpus, this give us a list of words and characters
machado_words = machado.words()
mac_words = mac_morpho.words()
floresta_words = floresta.words()
genesis_words = genesis.words('portuguese.txt')

In [6]:
l_machado = len(machado_words)
l_mac = len(mac_words)
l_floresta = len(floresta_words)
l_genesis = len(genesis_words)

print(f' Machado: {l_machado}  \n Mac: {l_mac} \n Floresta: {l_floresta} \n Genesis: {l_genesis}')

 Machado: 3121944  
 Mac: 1170095 
 Floresta: 211852 
 Genesis: 45094


Join all the list in a single corpus

In [8]:
total_corpus = machado_words + mac_words + floresta_words + genesis_words

Remove the special characteres and stopwords in portuguese

In [9]:
stopwords = nltk.corpus.stopwords.words('portuguese')
cs = [',', '.', '-', ';', '\x97', '"', '?', ':', '!',  '(', '--', ')', '...', '%', 'sr', '«', "'", '\x93', '}', '{', '\x94.', '/', '[', ']', 'dr']
stopwords.extend(cs)

In [10]:
stopwords

['a',
 'à',
 'ao',
 'aos',
 'aquela',
 'aquelas',
 'aquele',
 'aqueles',
 'aquilo',
 'as',
 'às',
 'até',
 'com',
 'como',
 'da',
 'das',
 'de',
 'dela',
 'delas',
 'dele',
 'deles',
 'depois',
 'do',
 'dos',
 'e',
 'é',
 'ela',
 'elas',
 'ele',
 'eles',
 'em',
 'entre',
 'era',
 'eram',
 'éramos',
 'essa',
 'essas',
 'esse',
 'esses',
 'esta',
 'está',
 'estamos',
 'estão',
 'estar',
 'estas',
 'estava',
 'estavam',
 'estávamos',
 'este',
 'esteja',
 'estejam',
 'estejamos',
 'estes',
 'esteve',
 'estive',
 'estivemos',
 'estiver',
 'estivera',
 'estiveram',
 'estivéramos',
 'estiverem',
 'estivermos',
 'estivesse',
 'estivessem',
 'estivéssemos',
 'estou',
 'eu',
 'foi',
 'fomos',
 'for',
 'fora',
 'foram',
 'fôramos',
 'forem',
 'formos',
 'fosse',
 'fossem',
 'fôssemos',
 'fui',
 'há',
 'haja',
 'hajam',
 'hajamos',
 'hão',
 'havemos',
 'haver',
 'hei',
 'houve',
 'houvemos',
 'houver',
 'houvera',
 'houverá',
 'houveram',
 'houvéramos',
 'houverão',
 'houverei',
 'houverem',
 'hou

Count the frequency of each word in the entire corpus

In [13]:
fdist = FreqDist(word.lower() for word in total_corpus if word not in stopwords)

In [14]:
fdist

FreqDist({'o': 18853, 'a': 18683, 'não': 12004, 'disse': 8684, 'e': 6908, 'ainda': 6054, 'casa': 5680, 'em': 5633, 'mas': 5586, 'dia': 5432, ...})

In [16]:
# Convert fdist to a dictionary  
words_freq = dict(fdist)


Create a dataframe with the frequency of each portuguese word

In [19]:
words = pd.DataFrame.from_dict(words_freq, orient='index').reset_index()
words.rename(columns={0: 'frequência', 'index': 'Palavra'}, inplace = True)
words.sort_values('frequência', ascending=False, inplace=True)

In [20]:
words.head(50)

Unnamed: 0,Palavra,frequência
30,o,18853
26,a,18683
70,não,12004
514,disse,8684
39,e,6908
569,ainda,6054
366,casa,5680
2401,em,5633
58,mas,5586
345,dia,5432


Replace number or speacial characteres

In [21]:
words['Palavra'] = words['Palavra'].str.replace('\W', '', regex=True)
words['Palavra'] = words['Palavra'].str.replace('\d+', '')

  words['Palavra'] = words['Palavra'].str.replace('\d+', '')


In [25]:
import numpy as np 
words.replace('', np.nan, inplace=True)
words.dropna(inplace=True)

In [26]:
words.reset_index(inplace=True, drop=True)

Select the most frequent words

In [31]:
common_words = words[words['frequência'] >= 1000]

In [32]:
common_words

Unnamed: 0,Palavra,frequência
0,o,18853
1,a,18683
2,não,12004
3,disse,8684
4,e,6908
...,...,...
274,vi,1024
275,seguinte,1015
276,viúva,1010
277,ah,1009


Translate the most common words to English

In [28]:
from deep_translator import GoogleTranslator

In [33]:
common_words['tradução'] = common_words['Palavra'].apply(lambda x: GoogleTranslator(source='portuguese', target='english').translate(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  common_words['tradução'] = common_words['Palavra'].apply(lambda x: GoogleTranslator(source='portuguese', target='english').translate(x))


In [36]:
common_words.head(15)

Unnamed: 0,Palavra,frequência,tradução
0,o,18853,O
1,a,18683,The
2,não,12004,no
3,disse,8684,he said
4,e,6908,It is
5,ainda,6054,yet
6,casa,5680,House
7,em,5633,in
8,mas,5586,but
9,dia,5432,day


Now that we have the most common words in Portuguese along with their respective translations, we can begin learning them and gain a better understanding of the language.

In [37]:
# Save the dataframe to a CSV file with the appropriate encoding to preserve the accents of the Portuguese language.
common_words.to_csv('common_words_trans.csv',encoding='utf-8-sig')