# Before you start

Before you get started with the program, I will briefly outline what the program contains and what it requires.

The program runs in Python, which you hopefully guessed at this point, and it uses Pandas which is an open source library. 

Python has been accessed through Jupyter Notebook and it is important to mention that the program only runs the cells you ask it to run. This means that you do not have to run through all the cells but can skip them if you do not find them relevant. BUT it is important that you run the cells that import Pandas and make the data frame. Failure to run the required cells may result in an error.

What can the program be used for:

The program is intended as a tool for organizing and navigating qualitative data in the form of transcribed interviews. However, it is entirely up to you to use the program for whatever you want.

What does the program contain:
- Two dummy texts to test run the program
- A data frame to organize your data
- Facilities for editing and adding to the data frame
- A list to categorize keywords
- Masks to navigate in your data frame

Dummy Text:
There are two premade dummy texts in the folder. You therefore have the opportunity to test the program and get to know it before using it for your own data. 
There are two dummy texts so you have the opportunity to work with them in the same data frame, as data does not necessarily come in one file. The two dummy texts are called dummy_text_1.csv and dummy_text_2.csv 

# Import Pandas

In [None]:
#Imports pandas If you already have Pandas in conjunction with Anaconda, you only need to type import Pandas. 
#If you do not already have Pandas installed, you need to do install it. 
import pandas as pd 
#To make it easier for ourselves, we import Pandas as pd. 
#That is, we only need to write pd when we call the Pandas library

print(pd.__version__)
#We do this to check out the Panda version. 
#The program is made with Pandas version 1.0.5
#if you do not have the same version there may be parts of the program that do not work optimally.

# Make the data frame

In [7]:
#We will now make two data frames which we will put together into one.

#First step:
#The data that should be in our first data frame, df1, comes from the file dummy_text_1.csv. 
#We call this data data1 to indicate that it comes from dummy text 1 and is in our df1.

data1 = pd.read_csv('dummy_text_1.csv', encoding='latin', sep=';') 

#We tell the program that data1 must be retrieved from the dummy_tex_1.csv file. 
#And that the encoding is 'latin', as it is this text language that the csv file is written in. 
#This is not always necessary and depends on the content of the individual file. 
#In addition, we tell the program that the seperator is ';' semicolon so the program knows how to split the text.

#Second step:
#The data frame we create must contain three main categories: Time, Speaker and Quotes

df1 = pd.DataFrame(data1, columns = ['Time','Speaker','Quotes'])


# Third step:

data2 = pd.read_csv('dummy_text_2.csv', encoding='latin', sep=';')

df2 = pd.DataFrame(data2, columns = ['Time','Speaker','Quotes'])

# Fourth step:
# It is now time to merge the two data frames, df1 and df2, into a data frame called df. 
# To do this we use append(). In this way, the two data frames are put together into one in extension of each other.
df = df1.append([df2])
# That is, df2 starts where df1 ends. This can be solved in many ways, but for our data this solution works.

# Fifth step:
#The fifth and final step in creating our final data frame is to reset the row index. 
#Since we have put two data frames together, there are two rows with numbers 0, 1, 2 and so on. 
df = df.reset_index()
#By resetting the row index, we ensure that only one row has, for example, number 0.

#By simply typing df and then running the cell, one can see the final data frame.
df

Unnamed: 0,index,Time,Speaker,Quotes
0,0,00:00:00,Speaker 1,"""Lorem ipsum dolor sit amet, consectetuer adip..."
1,1,00:00:02,Speaker 2,"""Cum sociis natoque penatibus et magnis dis pa..."
2,2,00:00:07,Speaker 1,"""Donec quam felis, ultricies nec, pellentesque..."
3,3,00:00:13,Speaker 2,"""In enim justo, rhoncus ut, imperdiet a, venen..."
4,4,00:00:14,Speaker 1,"""Nullam dictum felis eu pede mollis pretium."""
5,5,00:00:25,Speaker 2,"""Integer tincidunt. Cras dapibus."""
6,6,00:03:04,Speaker 1,"""Ivamus elementum semper nisi. Aenean vulputat..."
7,7,00:06:20,Speaker 2,"""Aliquam lorem ante, dapibus in, viverra quis,..."
8,8,00:23:05,Speaker 1,"""Etiam ultricies nisi vel augue. Curabitur ull..."
9,9,00:30:22,Speaker 2,"""Etiam rhoncus. Maecenas tempus, tellus eget c..."


# Add new column

In [8]:
# In the following, we will insert a new column in the existing data frame. 
# As an example, we will insert the column with the heading Topic.
df = df.reindex(columns = df.columns.tolist() + ['Topic'])
# We use reindex() and tell that we want to insert a column in the list that already consists of the columns in df.

#By typing df.head() and then running the cell, one can see the new column added to the data frame and the first 5 rows.
df.head()

Unnamed: 0,index,Time,Speaker,Quotes,Topic
0,0,00:00:00,Speaker 1,"""Lorem ipsum dolor sit amet, consectetuer adip...",
1,1,00:00:02,Speaker 2,"""Cum sociis natoque penatibus et magnis dis pa...",
2,2,00:00:07,Speaker 1,"""Donec quam felis, ultricies nec, pellentesque...",
3,3,00:00:13,Speaker 2,"""In enim justo, rhoncus ut, imperdiet a, venen...",
4,4,00:00:14,Speaker 1,"""Nullam dictum felis eu pede mollis pretium.""",


# Add value to cell in coloumn

In [9]:
# In the following, we will insert text in the new column we created with the heading Topic. 
# As an example, we insert words that could then act as keywords for the current row in relation to the category.

# We use loc that accesses a group of rows and columns by label(s). 
# As an example, we write 0. This indicates that this is the row with the index 0 we want to access. 
# Then we write the name of the column we want to access, precisely the column Topic.
# The last thing we do is write what we want to insert into that cell. 
# An example might be YouTube, as it could be a keyword that describes the topic of the quote
df.loc[0, 'Topic'] = 'YouTube'


#If we want to add a keyword to another cell, we repeat the process, but replace the row index.
df.loc[1, 'Topic'] = 'Media usage'

#By typing df.head() and then running the cell, one can see the new keywords added to the data frame.
df.head()

Unnamed: 0,index,Time,Speaker,Quotes,Topic
0,0,00:00:00,Speaker 1,"""Lorem ipsum dolor sit amet, consectetuer adip...",YouTube
1,1,00:00:02,Speaker 2,"""Cum sociis natoque penatibus et magnis dis pa...",Media usage
2,2,00:00:07,Speaker 1,"""Donec quam felis, ultricies nec, pellentesque...",
3,3,00:00:13,Speaker 2,"""In enim justo, rhoncus ut, imperdiet a, venen...",
4,4,00:00:14,Speaker 1,"""Nullam dictum felis eu pede mollis pretium.""",


# Category lists with keywords

In [10]:
#If you prefer to organize your possible keywords in a list, this is also an option. 
#As an example, we create a function where you can add a keyword to a list through an input.
#We call our list Movies as an example

Movies = []

#The number of lists can suit your needs depending on how many categories are relevant for your data.
#We call our function add_keyword_to_movies_list and define the keyword as the input.
#If the keyword is already in the Movies list, the function will print out a text otherwise the keyword will be added. 

def add_keyword_to_movies_list():
    keyword = str(input().lower())
    if keyword in Movies:
        print('The keyword is already in the list: Movies')
    else:
        Movies.append(keyword)
        print('The keywords in the Movies category are ', Movies)
        
#As the last feature in the function, the list will be printed and show every keyword the list contains.           

In [4]:
# When you call the fuction you will be able to type in your keyword and then it will be saved in the list.
add_keyword_to_movies_list()

enim
The keywords in the Movies category are  ['enim']


In [5]:
add_keyword_to_movies_list()

pretium
The keywords in the Movies category are  ['enim', 'pretium']


# Masks

In [11]:
# In the following, we will work with different masks to navigate the data frame.

# The first mask means that you only see the rows where a particular speaker speaks.  
view_speaker_1 = df["Speaker"]=="Speaker 1"
# We tell the program that in the data frame we are only interested in seeing the rows where the column Speaker is equal to Speaker 1.

# This mask can be used on all columns and rows to navigate and sort in the information displayed from the data frame.
# By typing the following we apply the mask to the dataframe
df[view_speaker_1]

Unnamed: 0,index,Time,Speaker,Quotes,Topic
0,0,00:00:00,Speaker 1,"""Lorem ipsum dolor sit amet, consectetuer adip...",YouTube
2,2,00:00:07,Speaker 1,"""Donec quam felis, ultricies nec, pellentesque...",
4,4,00:00:14,Speaker 1,"""Nullam dictum felis eu pede mollis pretium.""",
6,6,00:03:04,Speaker 1,"""Ivamus elementum semper nisi. Aenean vulputat...",
8,8,00:23:05,Speaker 1,"""Etiam ultricies nisi vel augue. Curabitur ull...",
10,10,00:45:33,Speaker 1,"""Maecenas nec odio et ante tincidunt tempus. D...",
12,12,01:43:05,Speaker 1,"""Sed consequat, leo eget bibendum sodales, aug...",
14,14,03:00:00,Speaker 1,"""Nam pretium turpis et arcu.Duis arcu tortor, ...",


In [14]:
# In the next mask different conditions must be met in the data frame before it is printed.
# As an example we want to print out the rows where the quotes contain the word 'et' and the speaker is Speaker 3.

contain_quotes_with_word = df[(df["Quotes"].str.contains(" et ")) & (df["Speaker"]== "Speaker 3")] 
# We use space before and after the word we want to search for because the word should be 'et' and not just contain the letters in that order.
contain_quotes_with_word

Unnamed: 0,index,Time,Speaker,Quotes,Topic
15,0,00:00:01,Speaker 3,"Lorem ipsum dolor sit amet, consectetur adipis...",
23,8,00:01:40,Speaker 3,Orci varius natoque penatibus et magnis dis pa...,
