# Word Co-occurences

To analyse co-occurences of topics of LGBTIQ datasets and visualize them:

https://rainynotes.net/co-occurrence-matrix-visualization/

Steps:
- Write topics in a text file, one line per id/dataset
- Count Vector Representation
- Turn into matrix

(Visualization of the co-occurrence matrix network is done by using Gephi, a open-source software to visualize network.)

More on CountVectorizer:

https://medium.com/swlh/understanding-count-vectorizer-5dd71530c1b 

## Write topics in a text file, one line per id/dataset

In [66]:
import pandas as pd
from slugify import slugify

df = pd.read_csv("keywords_LGBTIQ_datasets.csv", sep = ';', encoding="latin-1")
df = df.rename(columns={'id': 'ident'})

# Read csv into dictionary with ident as keys and topics as values (in a list)
# 2nd answer here: https://stackoverflow.com/questions/56062627/converting-csv-to-dictionary-with-one-column-as-keys-and-one-column-as-its-value 
df = df.applymap(str).groupby('ident')['topics_level_1'].apply(list).to_dict()
#print(df)

# Iterate over dictionary: https://realpython.com/iterate-through-dictionary-python/
# Convert list to string: https://www.freecodecamp.org/news/python-list-to-string-how-to-convert-lists-in-python/ 
for key in df:
    #print(df[key])
    with open("output.txt", 'w') as f:
        f.write('[')
        for key in df:
            string_list = [slugify(element) for element in df[key]]  # turns list into string and changes topics to dashes instead of spaces
            res = list(map(lambda st: str.replace(st, "-", "_"), string_list))   # Changes - to _ (necessary to be treated as one word later)
            delimiter = " "
            result_string = delimiter.join(res)
            f.write('"%s",\n' % (result_string)) 
        f.write(']')


  df = df.applymap(str).groupby('ident')['topics_level_1'].apply(list).to_dict()


## Count Vector Representation

CountVectorizer is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.

https://www.geeksforgeeks.org/using-countvectorizer-to-extracting-features-from-text/

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

### Tokenize
First change tokenizing behavior by using options in CountVectorizer.

In [67]:
from sklearn.feature_extraction.text import CountVectorizer

f = open("output.txt", "r")

count_vect = CountVectorizer()
count_vect.fit(f)

print(count_vect.vocabulary_)
# --> Unique words along with indeces (= words sorted alphabetically, index is the place in the sorted list)

{'14_social_stratification_and_groupings': 13, '16_society_and_culture': 15, '08_law_crime_and_legal_systems': 7, '04_health': 3, '19_other': 18, '07_labour_and_employment': 6, '11_politics': 10, '05_history': 4, '01_demography_population_vital_statistics_and_censuses': 0, '13_science_and_technology': 12, '03_education': 2, '06_housing_and_land_use': 5, '02_economics': 1, '09_media_communication_and_language': 8, '15_social_welfare_policy_and_systems': 14, '18_transport_and_travel': 17, '17_trade_industry_and_markets': 16, '12_psychology': 11, '10_natural_environment': 9}


### Derive vector

In [70]:
# Encode the Document
vector = count_vect.transform(f)
 
# Summarizing the Encoded Texts
print(vector.toarray())

[]


## Create matrix

In [None]:
import pandas as pd

names = cv.get_feature_names_out() # This are the entity names (i.e. keywords)
df = pd.DataFrame(data = Xc.toarray(), columns = names, index = names)
df.to_csv('to_gephi.csv', sep = ',')

In [None]:
# OLD CODE
'''
import pandas as pd

df = pd.read_csv("keywords_LGBTIQ_datasets.csv", sep = ';', encoding="latin-1")
df = df.rename(columns={'id': 'ident'})

# Read csv into dictionary with ident as keys and topics as values (in a list)
# 2nd answer here: https://stackoverflow.com/questions/56062627/converting-csv-to-dictionary-with-one-column-as-keys-and-one-column-as-its-value 
df = df.applymap(str).groupby('ident')['topics_level_1'].apply(list).to_dict()
#print(df)

# Iterate over dictionary: https://realpython.com/iterate-through-dictionary-python/
# Convert list to string: https://www.freecodecamp.org/news/python-list-to-string-how-to-convert-lists-in-python/ 
for key in df:
    #print(df[key])
    with open("output.txt", 'w') as f:
        for key in df:
            string_list = ['"{0}"'.format(element) for element in df[key]]  # turns list into string, but keeps " "
            delimiter = " "
            result_string = delimiter.join(string_list)
            f.write('"%s",\n' % (result_string))  
'''

In [None]:
# OLD CODE
'''
from sklearn.feature_extraction.text import CountVectorizer
import re

f = open("output.txt", "r")

tokenizer = lambda s: [token.strip() for token in re.split(r'(?= )', s.strip())]
# Defines how expressions are seperated using REGEX: Split at _"
# https://stackoverflow.com/questions/43067373/split-by-comma-and-how-to-exclude-comma-from-quotes-in-split/43067487#43067487
# https://docs.python.org/2/library/re.html#re.split

count_vect = CountVectorizer(tokenizer=tokenizer)
count_vect.fit(f)

print(count_vect.vocabulary_)

# Unique words along with indeces (= words sorted alphabetically, index is the place in the sorted list)
'''

In [None]:
# OLD CODE
'''
# sklearn countvectorizer
from sklearn.feature_extraction.text import CountVectorizer

f = open("output.txt", "r")

# Convert a collection of text documents to a matrix of token counts
cv = CountVectorizer(ngram_range=(1,1), stop_words = 'english')

# matrix of token counts
X = cv.fit_transform(f)
Xc = (X.T * X) # matrix manipulation
Xc.setdiag(0) # set the diagonals to be zeroes as it's pointless to be 1
'''