Analysis of lexical semantic network growth in children from different socio-economic backgrounds

This project was originally my term project for a computational linguistics course at Pitt. It was turned into a research project later and I am working on publishing the work.

Analysis of lexical semantic network growth in children from different socio-economic backgrounds

Man Ho Wong (m.wong@pitt.edu), University of Pittsburgh.
April 24, 2022

This project aims to investigate the relationship between early vocabulary development in children from different socio-economic backgrounds and their mother's child-directed speech (CDS). Lexical semantic networks for the child speech (CS) and the CDS were constructed from individual files in a dataset collected from CHILDES (see Data sources).

For the original project plan, please see project_plan.md.
progress_report.md documents the development of this project.
progress_presentation.pdf summarized the progress at the end of the spring semester 2022.

Here is the link for the final report submitted to the course LING1340/2340.

The guestbook for the project can be found Here.

1 Repo directory

./
 |---code/                           # code for data processing/analysis
 |   |---etc/
 |   |   |---PyLangAcq_notes.ipynb
 |   |   |---pittchat.py
 |   |
 |   |---data_curation.ipynb
 |   |---data_preprocessing.ipynb
 |   |---exploratory_analysis.ipynb
 |   |---pylangacq_license.txt
 |   |---vocabulary_analysis.ipynb
 |
 |---data/                           # processed and unprocessed data
 |   |---data_samples/               # data samples
 |
 |---reports/                        # reports and presentation
 |   |---images/                     # images used in the final report
 |   |---final_report.md
 |   |---progress_report.md
 |   |---progress_presentation.pdf
 |
 |---.gitignore
 |---LICENSE.md
 |---project_plan.md
 |---README.md                       # YOU ARE HERE

2 Data processing and analysis

The following scripts form the pipeline for data processing and analysis. Each generates the data required by the next script. They should be executed in the same sequence as listed:

data_curation.ipynb (nbviewer) curates datasets from CHILDES needed for this project.
data_preprocessing.ipynb (nbviewer) integrates datasets curated and cleans the data before analysis.
exploratory_analysis.ipynb (nbviewer) explores what kinds of linguistic analysis can be done with the curated data.
vocabulary_analysis.ipynb (nbviewer) examines the characteristics of semantic networks in children of different SES group.

3 Running the code

The code is written in Python 3.9.7. For easy sharing, scripts are organized into Jupyter notebooks (see above).

Viewing: You can view the notebooks either on GitHub or on nbviewer.org.

Running: To run the code, you will need a Jupyter Notebook interface. You can also run the code on Google Colab.

Below is a list of required libraries and packages that are not included in the Python Standard Library, as well as the version tested in this project:

Gensim (4.1.2)
Matplotlib (3.4.3)
NumPy (1.20.3)
Pandas (1.3.4)
PyLangAcq (0.16.0)
NLTK (3.6.5)
NetworkX (2.6.3)
scikit-learn (0.24.2)
Tqdm (4.62.3) (Optional, for showing progress bar during running)

4 About

Data sources

The corpus data used in this project was downloaded from the CHILDES database:

MacWhinney, B. (2000). The CHILDES Project: Tools for analyzing talk. Third Edition. Mahwah, NJ: Lawrence Erlbaum Associates.

See this page for more information.

This project also used data containing semantic vectors from ConceptNet Numberbatch 19.08, by Luminoso Technologies, Inc. You may redistribute or modify the data under a compatible Share-Alike license.

Python package `PyLangAcq`

The following Python package was used in this project for processing CHAT files:

Lee, Jackson L., Ross Burkholder, Gallagher B. Flinn, and Emily R. Coppess. 2016. Working with CHAT transcripts in Python. Technical report TR-2016-02, Department of Computer Science, University of Chicago.
Github repo: https://github.com/jacksonllee/pylangacq

The package is licensed under the MIT License. See pylangacq_license.txt for more information.

Licenses

The non-code parts of the project are licensed under Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0). See LICENSE-non_code.md for more information.
The rest of the project is licensed under GNU General Public License Version 3 (GPLv3). See LICENSE.md for more information.

Acknowledgment

I would like to thank my instructors and fellow students of the course Data Science for Linguists for their help and valuable inputs. I would also like to express my special thanks to Prof. Na-Rae Han for helping me to review the course Introduction to Computational Linguistics, which I missed last semester due to other commitments. Both courses helped me to devlop better computational thinking to work with large linguistic data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

code

code

data/data_samples

data/data_samples

reports

reports

.gitignore

.gitignore

LICENSE-non_code.md

LICENSE-non_code.md

LICENSE.md

LICENSE.md

README.md

README.md

project_plan.md

project_plan.md

Repository files navigation

Analysis of lexical semantic network growth in children from different socio-economic backgrounds

Table of contents

1 Repo directory

2 Data processing and analysis

3 Running the code

4 About

Data sources

Python package `PyLangAcq`

Licenses

Acknowledgment

About

Licenses found

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
code		code
data/data_samples		data/data_samples
reports		reports
.gitignore		.gitignore
LICENSE-non_code.md		LICENSE-non_code.md
LICENSE.md		LICENSE.md
README.md		README.md
project_plan.md		project_plan.md

License

Licenses found

Data-Science-for-Linguists-2022/Child-Vocab-Development

Folders and files

Latest commit

History

Repository files navigation

Analysis of lexical semantic network growth in children from different socio-economic backgrounds

Table of contents

1 Repo directory

2 Data processing and analysis

3 Running the code

4 About

Data sources

Python package PyLangAcq

Licenses

Acknowledgment

About

Topics

Resources

License

Licenses found

Stars

Watchers

Forks

Languages

Python package `PyLangAcq`