# Colab configuration

Mount your google drive to access the files stored.

In [1]:
%cd /content/drive/MyDrive/your_path/sqlite3-notes
%ll

[Errno 2] No such file or directory: '/content/drive/MyDrive/your_path/sqlite3-notes'
/content
total 8
drwx------ 4 root 4096 Nov 30 23:24 [0m[01;34mdrive[0m/
drwxr-xr-x 1 root 4096 Nov 13 17:33 [01;34msample_data[0m/


Or download the file directly from github

In [2]:
!curl -O https://raw.githubusercontent.com/G0erman/sqlite3-notes/main/sqlite3.po

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 79525  100 79525    0     0   180k      0 --:--:-- --:--:-- --:--:--  180k


# Exploration

In [3]:
!tail sqlite3.po -n 30

msgstr ""
"Versiones antiguas de SQLite tienen problemas compartiendo conexiones entre "
"hilos. Es por ello que el módulo de Python no permite compartir conexiones y "
"cursores entre hilos. Si se quiere intentar esto, se obtendrá una excepción "
"en tiempo de ejecución."

#: ../Doc/library/sqlite3.rst:1104
msgid ""
"The only exception is calling the :meth:`~Connection.interrupt` method, "
"which only makes sense to call from a different thread."
msgstr ""
"La única excepción es llamando el método :meth:`~Connection.interrupt`, el "
"cual solamente tiene sentido llamarlo desde un hilo diferente."

#: ../Doc/library/sqlite3.rst:1108
msgid "Footnotes"
msgstr "Notas al pie"

#: ../Doc/library/sqlite3.rst:1109
msgid ""
"The sqlite3 module is not built with loadable extension support by default, "
"because some platforms (notably Mac OS X) have SQLite libraries which are "
"compiled without this feature. To get loadable extension support, you must "
"pass --enable-loadable-sqlite-extension

The file contains the below structure:

* `#: ..(library):line_number`
* `msgid` Original text in English.
* `msgstr` Translated text to Spanish.

To do: find a regular expression tha parse this structure.

# Libraries

In [4]:
!pip install polib

Collecting polib
  Downloading https://files.pythonhosted.org/packages/30/a2/e407c3b00cace3d7fc8df14d364deeecfeb96044e1a317de583bc26eae58/polib-1.1.0-py2.py3-none-any.whl
Installing collected packages: polib
Successfully installed polib-1.1.0


In [5]:
# Utils
import re 
from collections import Counter

# Handly polib
import polib

# Natural Language Processing
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
nltk.download('punkt') # sentence tokenizer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Read Data

In [6]:
po = polib.pofile('sqlite3.po')
valid_entries = [entry for entry in po if not entry.obsolete]

In [7]:
english_text = [entry.msgid for entry in valid_entries]
spanish_text = [entry.msgstr for entry in valid_entries]

# Text Processing

In [8]:
english_text[:5]

[':mod:`sqlite3` --- DB-API 2.0 interface for SQLite databases',
 '**Source code:** :source:`Lib/sqlite3/`',
 "SQLite is a C library that provides a lightweight disk-based database that doesn't require a separate server process and allows accessing the database using a nonstandard variant of the SQL query language. Some applications can use SQLite for internal data storage.  It's also possible to prototype an application using SQLite and then port the code to a larger database such as PostgreSQL or Oracle.",
 'The sqlite3 module was written by Gerhard Häring.  It provides a SQL interface compliant with the DB-API 2.0 specification described by :pep:`249`.',
 'To use the module, you must first create a :class:`Connection` object that represents the database.  Here the data will be stored in the :file:`example.db` file::']

## Basic statistics

In [9]:
# Split sentences in words
all_text = [nltk.word_tokenize(sentence) for sentence in english_text]

In [10]:
all_text[:2]

[[':',
  'mod',
  ':',
  '`sqlite3`',
  '--',
  '-',
  'DB-API',
  '2.0',
  'interface',
  'for',
  'SQLite',
  'databases'],
 ['**Source', 'code', ':', '**', ':', 'source', ':', '`Lib/sqlite3/`']]

In [11]:
def clean_word(x):
  """Delete not alphabetic caracteres."""
  word = re.sub(r'\W+', r' ', x.lower()).strip()
  word = re.sub(r'\d+', r' ', word).strip()
  #print(word)
  if (not word) or (len(word) < 3):
    return None
  else:
    return word

def test_clean_word():
  assert(clean_word(':') == None)
  assert(clean_word('Interface') == 'interface')
  assert(clean_word('DB-API') == 'db api')
  assert(clean_word('5') == None)
  assert(clean_word(' ti') == None)

test_clean_word()

In [12]:
all_txt_clean = [clean_word(word) for sentence in all_text for word in sentence]
all_txt_clean = [word for word in all_txt_clean if word]
print(f'Total words: {len(all_txt_clean)}')
print(f'Unique words: {len(set(all_txt_clean))}')

Total words: 3884
Unique words: 939


In [13]:
# All Words
result = Counter(all_txt_clean)
result.most_common(15)

[('the', 370),
 ('sqlite', 85),
 ('and', 70),
 ('you', 70),
 ('for', 60),
 ('this', 60),
 ('class', 55),
 ('database', 54),
 ('that', 49),
 ('can', 41),
 ('will', 39),
 ('with', 31),
 ('cursor', 30),
 ('sql', 29),
 ('module', 29)]

In [18]:
# Remove stopwords
en_stopword = stopwords.words('english')
all_txt_clean = [word for word in all_txt_clean if word not in en_stopword]
print(f'Total words: {len(all_txt_clean)}')
print(f'Unique words: {len(set(all_txt_clean))}')

Total words: 2747
Unique words: 866


In [19]:
result = Counter(all_txt_clean)
result.most_common(15)

[('sqlite', 85),
 ('class', 55),
 ('database', 54),
 ('cursor', 30),
 ('sql', 29),
 ('module', 29),
 ('connection', 29),
 ('meth', 29),
 ('method', 29),
 ('python', 28),
 ('types', 24),
 ('name', 23),
 ('parameter', 23),
 ('mod', 22),
 ('use', 22)]

# To do:

* Print Wordcloud
* Lemmatization
* TFIDF
* Advance Statistics

In [None]:
def catch_reference(x):
  """Catch objects :mod:`sqlite3`, This could be references to: 
  modules, libraries, clases, etc.

  parameters:
    x: all text to parse

  returns:
    ref: object's dictionary 
  """
  ref = re.sub(r':(mod):`(sqlite3)`(.*)', r'{"\1":["\2"]}', x)
  print(ref)
  return ref

def test_catch_reference():
  assert catch_reference(':mod:`sqlite3` --- DB-API 2.0 int... databases') == \
  'mod,sqlite3'

test_catch_reference()  