# Welcome to the Natural Language Processing (NLP) jupyter notebook series!
This mini-course has been prepared with the aim of showing rather practical side of the NLP, than detailed theoretical aspects. This and further notebooks contain brief theoretical introduction to every concept, which is later implemented using Python and popular Python NLP modules. Each notebook also contains practical exercises along with complete solutions.

Intro to NLP???

read input (1) and process (2)

## 1. Reading files
Usually, natural language processing tasks will be performed on rather large amount of data. Since copy & paste works fine for small articles or paragraphs, we can't really use it when there are thousands of them. Large data sets are stored in files of different formats (e.g. .txt, .csv, .log), which impose a strict rules on how the file is structured. This is important for two reasons in terms of reading files.

Firstly, a given file should be interpreted in the same, unambiguous way by all users whether they are people or computer programs. Secondly, the file structure should be corresponding to its content, e.g. if a file contains user tweets it will be more natural to put one tweet per line rather one word per line. By utilizing the fact of how the file is structured, a lot of pre-processing can be ommited by creating an appropriate file loader. 

 A common way of structuring files is by using delimiters like commas ',' or endline characters '\n'. Very large data sets may be also organized using data bases and their own file systems.

### 1.1 Txt files
Let's see how to deal with a text file (words.txt) containing 10000 most common English words, one per line. The simplest idea would be to read the whole file and store it as a string. However, by doing so we do not take the advantage of the file structure. With a minimal additional cost instead of one big string, we can create a list containing 10000 separate string elements - words. Let's see how it can be done!

In [5]:
word_list = []

# open 'words.txt' file in a reading 'r' mode.
with open('words.txt', mode='r', encoding='utf-8') as f:
    word_list = f.read().splitlines()  # split lines based on the endline '\n' character

Now, let's see first 20 words and last 20 words in our list.

In [6]:
print(word_list[:20]) # First 20 words
print(word_list[-20:]) # Last 20 words

['a', 'aa', 'aaa', 'aaron', 'ab', 'abandoned', 'abc', 'aberdeen', 'abilities', 'ability', 'able', 'aboriginal', 'abortion', 'about', 'above', 'abraham', 'abroad', 'abs', 'absence', 'absent']
['zambia', 'zdnet', 'zealand', 'zen', 'zero', 'zimbabwe', 'zinc', 'zip', 'zoloft', 'zone', 'zones', 'zoning', 'zoo', 'zoom', 'zoophilia', 'zope', 'zshops', 'zu', 'zum', 'zus']


In some cases, we don't want to read the whole file (e.g. when the file is huge). Let's see how to read first 50 words of the words.txt file.
Note: a very common mistake when reading files line by line is not removing the endline '\n' character, which is at the end of each line (see .rstrip() below).

In [7]:
word_list = []
lines_to_read_num = 50

# open 'words.txt' file in a reading 'r' mode.
with open('words.txt', mode='r', encoding='utf-8') as f:
    for _ in range(lines_to_read_num):  # _ is a wildcard
        word_list.append(f.readline().rstrip()) # We want to strip the last character of each line since it is and endline '\n' character

print(word_list[:20]) # First 20 words

['a', 'aa', 'aaa', 'aaron', 'ab', 'abandoned', 'abc', 'aberdeen', 'abilities', 'ability', 'able', 'aboriginal', 'abortion', 'about', 'above', 'abraham', 'abroad', 'abs', 'absence', 'absent']


### 1.2 File Encoding 
Now, let's try to load another text file (japaneseWords.txt) containing 10000 most common Japanese words. 

In [8]:
japanese_words = []

# open 'words.txt' file in a reading 'r' mode.
with open('japaneseWords.txt', mode='r', encoding='utf-8') as f:
    japanese_words = f.read().splitlines()  # split lines based on the endline '\n' character
print(japanese_words[:20]) # First 20 words

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Oops, we got the `UnicodeDecodeError`... This error is raised because the default encoding used by Python for reading files (`utf-8`) is different from the one used in the japanese words file (`utf-16`). Even though `utf-8` is widely used and in most cases sufficient, in the future you may deal with data encoded using other encodings (like in this case). Try to solve this problem by approprietly changing the `encoding` parameter in the `open` function and run it again!

### 1.3 CSV files
So far, our data set consisted of one word per line without any additional details. However, in many cases each entry in your data set may contain more than one field. For example, imagine that in the english word list, we would like to annotate each word with a name of the lexical categories it belongs to (verb, noun, etc.). In cases like this, csv files are very convenient.